REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  0704-0188 


The  public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including 
suggestions  for  reducing  the  burden,  to  the  Department  of  Defense,  Executive  Service  Directorate  (0704-0188).  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no 
person  shall  be  subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  information  if  it  does  not  display  a  currently  valid  OMB  control  number. 

PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ORGANIZATION. 


1.  REPORT  DATE  (DD-MM-YYYY) 

29-02-2012 


2.  REPORT  TYPE 

Final  performance  report 


4.  TITLE  AND  SUBTITLE 

Joint  Information  Theoretic  and  Differential  Geometrical  Approach  for  Robust 
Automated  Target  Recognition 


3.  DATES  COVERED  (From  -  To) 

March  2009  -Feb.  2012 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 

FA9550-09-1-0132 


5c.  PROGRAM  ELEMENT  NUMBER 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Department  of  Electrical  &  Computer  Engineering 

University  of  Florida 
Gainesville,  Florida  3261 1-6130 


5d.  PROJECT  NUMBER 

00075874 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

USAF,  AFRL 

AF  OFFICE  OF  SCIENTIFIC  RESEARCH 
875  N.  RANDOLPH  ST.  ROOM  31 12 
ARLINGTON,  VA  22203 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Distribution  unlimited 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 


11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

AFRL-OSR-VA-TR-201 2-0357 


14.  ABSTRACT 

The  overall  objective  of  this  project  is  to  develop  transformative  theory  and  algorithms  for  robust  Automated  Target  Recognition  (ATR).  This 
project  addressed  the  following  challenging  problems  in  ATR:  modeling  uncertainty,  small  sample  size,  high  dimensional  data,  irrelevant 
features/dimensions,  heterogeneous  data,  and  outliers.  In  this  project,  the  PI  proposed  and  developed  the  following  new  techniques: 

1)  kernel  local  feature  extraction  (KLFE)  for  ATR  applications,  2)  technique  for  identifying  network  dynamics  under  sparsity  and  stationarity 
constraints,  3)  self-organized-queue-based  ( SOQ)  clustering  scheme,  4)  robust  principal  component  analysis  fRPCA)  based  on  manifold 
optimization,  outlier  detection,  and  subspace  decomposition. 


15.  SUBJECT  TERMS 

Feature  extraction,  clustering,  identification  of  network  dynamics,  robust  principal  component  analysis 


16.  SECURITY  CLASSIFICATION  OF:  | 

a.  REPORT 

b.  ABSTRACT 

c.  THIS  PAGE 

u 

u 

u 

17.  LIMITATION  OF 
ABSTRACT 


18.  NUMBER  19a.  NAME  OF  RESPONSIBLE  PERSON 

Sges  DaPengWu _ 

19b.  TELEPHONE  NUMBER  (Includearea  code) 

62  352-392-4954 


Standard  Form  298  (Rev.  8/98) 

Prescribed  by  ANSI  Std.  Z39.18 

Adobe  Professional  7.0 


Air  Force  Office  of  Scientific  Research  Grant  FA9550-09- 1-0132: 


Joint  Information  Theoretic  and  Differential  Geometrical 
Approach  for  Robust  Automated  Target  Recognition 

(Performance  Period:  March  2009  -  February  2012) 

Final  Report 

PI:  Dapeng  “Oliver”  Wu 
Department  of  Electrical  and  Computer  Engineering 
University  of  Florida 

431  Engineering  Building,  P.O.Box  116130 
Gainesville,  FL  32611-6130 
Tel:  (352)  392-4954,  Fax:  (352)  392-0044 
Email:  wu@ece.ufl.edu,  Web:  http://www.wu.ece.ufl.edu 


February  29,  2012 


Contents 


1  Summary  1 

2  Kernel-based  Feature  Extraction  under  Maximum  Margin  Criterion  2 

2.1  Introduction .  3 

2.2  Related  Work .  5 

2.3  Review  of  RELIEF  and  LFE .  6 

2.3.1  Nearest  Hit  (NH)  and  Nearest  Miss  (NM) .  7 

2.3.2  RELIEF .  7 

2.3.3  LFE .  8 

2.4  Kernel  LFE .  9 

2.4.1  Nonlinear  LFE  in  High-Dimensional  Space .  9 

2.4.2  Basis  Rotation  Invariant  Property  of  KLFE .  11 

2.4.3  KLFE  using  KPCA .  16 

2.4.4  KLFE  Algorithm .  18 

2.5  Experimental  Results .  19 

2.5.1  Experimental  Setting .  19 

2.5.2  Experimental  Results  for  Simulated  Data .  21 

2.5.3  Experimental  Results  for  Real-World  Datasets  .  23 

2.6  Conclusion  .  24 

3  Identification  of  Network  Dynamics  under  Sparsity  and  Stationarity  Constraints  26 

3.1  Introduction .  26 

3.2  The  VAR  Model  and  Problem  Formulation .  29 

3.2.1  Conditional  Maximum  Likelihood  Estimation  of  VAR .  29 


3.2.2  Penalized  Estimation  and  the  Berhu  Penalty 


30 


3.2.3  Sparse  and  Stationary  Estimation  of  the  VAR  Model .  32 

3.2.4  Equivalent  Formulations .  34 

3.3  The  BIPS  Framework .  37 

3.3.1  The  Thresholding  Rule  for  Berhu  .  37 

3.3.2  The  BIPS  Algorithm .  39 

3.3.3  Thresholding-based  Iterative  Screening .  41 

3.3.4  Tuning  Strategy .  43 

3.4  Bootstrap  Enhanced  Learning .  44 

3.4.1  The  BE-BIPS  Framework  .  44 

3.4.2  Stationary  Bootstrap .  45 

3.5  Experiment .  46 

3.5.1  Performance  Measures .  46 

3.5.2  Experiment  Settings  .  46 

3.5.3  Performance  of  BIPS .  47 

3.5.4  Performance  of  TIS .  48 

3.6  Application  to  U.S.  Macroeconomic  Data .  49 

3.6.1  Comparison  of  Rolling  MSE .  50 

3.6.2  Bootstrap  Analysis .  51 

3.7  Conclusion  .  53 

4  Conclusions  53 

List  of  Figures 

1  A  typical  pattern  classification  system .  3 

2  USPS  handwritten  digits  3(top  row)  and  5(bottom  row) .  20 

3  Simulated  data  set  containing  sine  surfaces .  22 

ii 


4  Simulated  data  set  containing  Swiss  roll .  23 

5  Classification  error  rate  vs.  target  feature  dimension  of  simulated  data .  24 

6  Classification  error  rate  vs.  target  feature  dimension  on  Swiss  Roll .  25 

7  Classification  error  rate  vs.  target  feature  dimension  of  UCI  data .  25 

8  Classification  error  rate  vs.  target  feature  dimension  on  usps  3  vs.  5 . 25 

9  Example  of  Huber  and  Berhu .  32 

10  Example  of  stationary  and  nonstationary  VAR  processes.  The  number  of  nodes  is 

p  =  50.  The  stationary  VAR  process  has  p(A)  =  0.95,  and  the  nonstationary  VAR 
process  has  p(A)  =  1.05 .  33 

11  Penalty  functions  and  corresponding  thresholding  rules.  A  =  0.2,  M  =  1.3 . 38 

12  Comparison  of  Patterns  and  Sample  Paths.  Top:  pattern  of  A  and  observations 

from  the  true  model;  Middle:  pattern  of  Abips  and  sample  path  from  the  corre¬ 
sponding  stationary  model;  Bottom:  pattern  of  AP lasso  and  sample  path  from  the 
corresponding  nonstationary  model .  49 

13  Topology  of  the  macroeconomic  network  in  the  pre-Great  Moderation  period.  ...  52 

14  Topology  of  the  macroeconomic  network  in  the  post-Great  Moderation  period.  .  .  52 

List  of  Tables 

1  UCI  and  USPS  data  sets  used  in  the  experiments .  20 

2  Performance  comparison  of  Lasso,  Berhu,  and  BIPS .  48 

3  Pmiss  of  TIS  and  SIS .  48 

4  Performance  of  TIS .  50 

5  Normalized  Rolling  MSE  of  Lasso  and  BIPS  for  each  category .  51 

6  Rolling  MSE  of  Lasso  and  BIPS  for  different  horizons  .  51 

iii 


1  Summary 


The  overall  objective  of  this  project  is  to  develop  transformative  theory  and  algorithms  for  robust 
Automated  Target  Recognition  (ATR).  This  project  addressed  the  following  challenging  problems 
in  ATR:  modeling  uncertainty,  small  sample  size,  high  dimensional  data,  irrelevant  features/di¬ 
mensions,  heterogeneous  data,  and  outliers.  In  this  project,  the  PI  proposed  and  developed  the 
following  new  techniques: 

•  Kernel-based  feature  extraction  under  maximum  margin  criterion:  We  developed  a  tech¬ 
nique  called  kernel  local  feature  extraction  (KLFE)  for  ATR  applications.  Compared  with 
other  feature  extraction  algorithms,  KLFE  enjoys  nice  properties  such  as  low  computational 
complexity,  and  high  probability  of  identifying  relevant  features;  this  is  because  KLFE  is  a 
nonlinear  wrapper  feature  extraction  method  and  consists  of  solving  a  simple  convex  opti¬ 
mization  problem.  The  experimental  results  have  shown  the  superiority  of  KLFE  over  the 
existing  algorithms. 

•  Identification  of  network  dynamics  under  sparsity  and  stationarity  constraints:  The  sparse 
vector  autoregressive  model  (VAR)  is  commonly  used  for  modeling  dynamic  networks,  such 
as  tank  movements,  troops  movements,  brain  functional  networks,  stock  markets  and  social 
networks.  A  penalized  linear  regression  was  proposed  to  identify  the  autoregressive  coeffi¬ 
cient  matrices  with  sparsity  constraint.  However,  though  the  VAR  model  is  assumed  to  be 
stationary,  this  property  is  never  taken  into  consideration  by  the  penalized  linear  regression. 
Moreover,  the  present  techniques  for  estimating  a  VAR  model  are  only  applicable  to  the 
problems  with  relatively  low  dimensionality  and  large  number  of  observations.  The  main 
purpose  of  this  work  is  to  tackle  these  challenging  issues.  We  formulate  the  problem  as  pe¬ 
nalized  linear  regression  with  stationarity  constraint,  and  propose  the  Berhu  iterative  sparsity 
pursuit  with  stationarity  constraint  (BIPS)  to  solve  the  problem  efficiently.  Berhu  is  a  novel 
scheme  with  hybrid  penalty  that  improves  the  Lasso  scheme  for  high  collinearity  problem. 
We  also  implement  the  screening  technique  into  BIPS  for  dealing  with  the  “large  p  small  n” 
problem.  A  bootstrap  enhanced  learning  procedure  is  applied  to  approximate  the  probability 
of  existence  for  each  connection.  Experiments  show  that  our  method  guarantees  a  stationary 
estimate,  outperforms  Lasso  in  estimation  accuracy,  and  works  well  for  high-dimensional 
problems. 

•  Self-organized-queue-based  (SOQ)  clustering:  In  this  work,  we  take  a  bio-inspired  approach 
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to  the  graph  clustering  problem  and  enable  fictitious  queues  with  self-organizing  capability 
to  cluster  nodes  in  a  graph.  Our  SOQ  clustering  scheme  is  one  type  of  kernel  clustering 
schemes.  Experimental  results  have  demonstrated  the  superiority  of  our  SOQ  scheme  over 
the  existing  schemes  such  as  K-means,  kernel  K-means,  spectral  clustering  and  normalized 
cuts. 

•  Robust  principal  component  analysis  (RPCA):  Our  RPCA  scheme  is  based  on  manifold  op¬ 
timization,  outlier  detection,  and  subspace  decomposition.  Experimental  results  have  shown 
that  our  RPCA  scheme  significantly  outperforms  PCA  when  there  are  outliers  in  the  data; 
our  RPCA  scheme  also  performs  much  better  than  the  existing  robust  PCA  schemes  under 
the  condition  that  both  Gaussian  noise  and  outliers  exist  in  the  data. 

Publications  during  the  reporting  period: 

•  J.  Wang,  J.  Fan,  H.  Li,  D.  Wu,  “Kernel-based  Feature  Extraction  under  Maximum  Margin 
Criterion,”  Journal  of  Visual  Communication  and  Image  Representation,  vol.  23,  no.  1,  pp. 
53-62,  January  2012. 

•  Y.  He,  Y.  She,  D.  Wu,  ’’Identification  of  Stationary  Sparse  Vector  Autoregressive  Model,”  to 
be  submitted  to  Journal  of  Machine  Learning  Research. 

•  Y.  She,  S.  Li,  D.  Wu,  ’’Manifold-Optimization-Based  Robust  Principal  Component  Analy¬ 
sis,”  to  be  submitted  to  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence. 

•  B.  Sun,  D.  Wu,  ’’Self-Organized-Queue-Based  Clustering:  a  Bio-inspired  Approach,”  to  be 
submitted  to  Nature. 


2  Kernel-based  Feature  Extraction  under  Maximum  Margin 
Criterion 

In  this  work,  we  study  the  problem  of  feature  extraction  for  pattern  classification  applications.  RE¬ 
LIEF  is  considered  as  one  of  the  best-performed  algorithms  for  assessing  the  quality  of  features 
for  pattern  classification.  Its  extension,  local  feature  extraction  (LFE),  was  proposed  recently  and 
was  shown  to  outperform  RELIEF.  In  this  paper,  we  extend  LFE  to  the  nonlinear  case,  and  develop 
a  new  algorithm  called  kernel  LFE  (KLFE).  Compared  with  other  feature  extraction  algorithms, 
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Figure  1 :  A  typical  pattern  classification  system. 

KLFE  enjoys  nice  properties  such  as  low  computational  complexity,  and  high  probability  of  identi¬ 
fying  relevant  features;  this  is  because  KLFE  is  a  nonlinear  wrapper  feature  extraction  method  and 
consists  of  solving  a  simple  convex  optimization  problem.  The  experimental  results  have  shown 
the  superiority  of  KLFE  over  the  existing  algorithms.  Next,  we  present  the  technical  details. 

2.1  Introduction 

In  this  paper,  we  study  the  problem  of  feature  extraction  for  pattern  classification  applications.  As 
shown  in  Fig.  1,  a  typical  pattern  classification  system  consists  of  two  parts:  one  for  the  training 
phase  and  one  for  the  classification  phase.  In  the  training  phase,  the  system  is  given  a  training  data 
set, 


V  =  {(xn,yn)}^=1cXxy,  (1) 

where  N  is  the  number  of  samples  in  the  training  data  set,  X  C  R1  is  the  /-dimensional  fea¬ 
ture  space,  and  y  =  {±1}  1  is  the  label  space.  At  the  end  of  the  training  phase,  the  system 
obtains  a  parameter  set,  which  provides  information  needed  by  the  feature  extraction  module  and 
the  classifier  in  the  classification  phase.  Although  in  many  papers,  the  feature  extraction  module 
for  a  classification  system  is  not  explicitly  specified,  it  is  a  significant  component.  A  good  feature 
extraction  technique  not  only  reduces  system  complexity  and  processing  time,  but  also  improves 
classification  accuracy  by  eliminating  irrelevant  features. 

'in  this  paper,  we  focus  on  two-category  pattern  classification  problem. 
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In  this  paper,  we  propose  a  feature  extraction  algorithm,  called  kernel  local  feature  extraction 
(KLFE),  which  has  low  computational  complexity,  and  achieves  high  probability  of  identifying 
irrelevant  features  for  removal  (dimension  reduction). 

To  begin  with,  we  define  feature  extraction  as  a  mapping 

f  :X  CR'^I'C  R1',  (2) 

which  maps  patterns  in  input  feature  space  X  to  output  feature  space  X',  in  order  to  optimize  some 
pre-defined  criterion.  Usually,  we  have  I'  f  I  so  that  features  are  mapped  from  a  high-dimensional 
space  to  a  lower  one,  which  reduces  system  complexity. 

A  feature  extraction  method  can  be  categorized  according  to  the  following  criteria. 

First,  a  feature  extraction  method  can  be  linear  or  nonlinear.  A  linear  feature  extraction  method 
has  the  form 


/(x)  =  X  •  x,  x  G  X :  (3) 

where  A  is  a  /'  x  /  matrix.  Otherwise,  it  is  nonlinear.  Usually  nonlinear  methods  outperform 
linear  ones  as  nonlinear  methods  are  able  to  capture  the  true  pattern,  which  is  usually  nonlinear.  In 
this  paper,  we  propose  a  nonlinear  feature  extraction  method,  called  KLFE,  which  has  the  form  of 

/(x)  =  A  •  <^(x) ,  (4) 

where  ip  :  X  C  R1  — *  X  C  R1  is  a  nonlinear  function  and  A  is  an  /'  x  I  matrix. 

Second,  feature  extraction  has  two  special  cases,  namely,  feature  selection  and  feature  weight¬ 
ing.  If  A  in  Eqs.  (3)  and  (4)  is  a  diagonal  matrix  whose  diagonal  elements  are  restricted  to  either 
0  or  1,  the  feature  extraction  method  is  also  called  feature  selection ;  if  the  diagonal  elements  of 
diagonal  matrix  A  can  take  any  real-valued  number  between  0  and  1,  the  feature  extraction  method 
is  also  called  feature  weighting. 

Allowing  diagonal  elements  of  A  to  take  real- valued  numbers,  instead  of  binary  ones,  enables 
feature  weighting  to  employ  some  well-established  optimization  techniques  and  thus  allows  for 
more  efficient  algorithm  implementation  than  feature  selection.  A  celebrated  feature  weighting 
method  is  the  REFIEF  algorithm  [1]. 

One  major  shortcoming  of  feature  selection  and  feature  weighting  is  their  inability  to  remove 
correlation  among  different  feature  dimensions  so  as  to  achieve  a  sparser  representation  of  data 
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samples  [2].  In  some  applications,  such  as  object  recognition,  where  there  is  no  need  to  preserve 
the  physical  meaning  of  individual  features,  feature  extraction  is  more  appropriate  than  feature 
selection  and  feature  weighting.  For  example,  Sun  and  Wu  [3]  [4]  proposed  a  linear  feature  extrac¬ 
tion  method,  called  local  feature  extraction  (LFE),  which  outperforms  feature  selection  and  feature 
weighting.  In  this  paper,  we  extend  LFE  to  a  nonlinear  one  called  Kernel  LFE. 

Third,  a  feature  extraction  method  can  be  further  categorized  as  a  wrapper  method  or  a  filter 
method  [5].  A  wrapper  method  determines  the  mapping  in  Eq.  (2)  by  minimizing  the  classification 
error  rate  of  a  classifier,  whereas  a  filter  method  does  not.  Therefore,  filter  methods  are  compu¬ 
tationally  more  efficient  but  usually  perform  worse  than  wrapper  methods.  The  famous  principal 
component  analysis  (PCA)  [6,  page  115]  is  a  filter  method,  whereas  RELIEF  [1]  and  LFE  [3]  are 
wrapper  methods,  as  they  both  optimize  a  1 -nearest-neighbor  (1-NN)  classifier  [6,  page  174].  Our 
KLFE  algorithm  is  also  a  wrapper  method  in  that  it  optimizes  a  1-NN  classifier  in  the  nonlinear- 
transformed  space. 

Last,  a  feature  extraction  method  can  be  obtained  by  solving  a  convex  optimization  problem 
or  a  non-convex  optimization  problem.  A  convex  optimization  problem  formulation  is  preferred 
since  it  can  be  solved  efficiently,  compared  to  a  non-convex  optimization  problem.  PCA,  RELIEF, 
and  LFE  are  all  based  on  convex  optimization  formulation.  Our  KLFE  algorithm  is  also  based  on 
a  convex  optimization  formulation,  which  admits  a  closed-form  solution. 

In  summary,  in  this  paper,  we  propose  a  nonlinear  wrapper  feature  extraction  method,  which  is 
based  on  a  convex  optimization  formulation. 

The  remainder  of  the  section  is  organized  as  follows.  In  Section  2.2,  related  work  is  briefly 
reviewed.  In  Section  2.3,  we  describe  two  existing  feature  extraction  techniques,  namely,  RE¬ 
LIEF  and  LFE;  our  proposed  KLFE  is  a  generalization  of  these  two  algorithms.  In  Section  2.4, 
we  present  a  novel  feature  extraction  algorithm,  called  KLFE.  Section  3.5  presents  experimen¬ 
tal  results,  which  demonstrate  that  KLFE  outperforms  the  existing  feature  extraction  algorithms. 
Section  3.7  concludes  this  work. 

2.2  Related  Work 

In  this  section,  we  briefly  review  some  feature  extraction  algorithms,  which  will  be  compared  to  our 
KLFE  algorithm.  Principal  component  analysis  (PCA)  [6,  page  115]  is  probably  one  of  the  most 
commonly  used  algorithms  for  feature  extraction.  One  major  drawback  of  PCA,  however,  is  that  it 
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is  targeted  at  minimizing  mean  squared  error  for  data  compression  or  efficient  data  representation, 
rather  than  minimizing  the  classification  error  probability  for  pattern  classification.  Other  PCA- 
type  algorithms,  e.g.,  kernel  PCA  (KPCA)  [7],  usually  perform  better  than  PCA  in  representing 
nonlinear  relationship  among  different  feature  dimensions,  but  suffer  from  the  same  limitation  as 
PCA  since  they  do  not  use  class  labels  in  the  training  phase;  i.e.,  they  are  unsupervised  algorithms. 
The  KLFE  algorithm  proposed  in  this  paper,  like  its  predecessors  LFE  and  RELIEF,  utilizes  the 
class  label  information  in  the  training  phase;  i.e.,  KLFE  is  an  supervised  algorithm. 

Among  the  existing  feature  weighting  techniques,  RELIEF  [1]  is  considered  as  one  of  the  best- 
performed  ones  due  to  its  simplicity  and  effectiveness  [8].  RELIEF  determines  the  parameters 
of  diagonal  matrix  A  in  Eq.  (3)  by  solving  a  convex  optimization  problem,  which  maximizes  a 
margin-based  criterion  [9].  The  LFE  algorithm,  proposed  in  Ref.  [3],  is  an  extension  to  RELIEF; 
LFE  removes  the  constraint  of  matrix  A  in  Eq.  (3)  being  diagonal,  which  is  required  by  RELIEF. 
Both  LFE  and  RELIEF  are  linear  methods.  In  contrast,  our  KLFE  method  is  a  nonlinear  extension 
to  LFE;  experimental  results  show  that  KLFE  performs  better  than  LFE. 

In  Ref.  [10],  the  authors  extended  RELIEF  to  a  kernel  space,  which  is  the  space  that  contains 
the  image  of  the  nonlinear-transformation  used  in  a  kernel  method;  their  approach  is  to  identify 
an  orthonormal  basis  of  the  kernel  space  and  perform  RELIEF  in  this  kernel  space;  the  resulting 
schemes  are  called  Feature  Space  KPCA  (FSKPCA)  and  Feature  Space  Kernel  Gram-Schmidt 
Process  (FSKGP).  They  showed  that  FSKPCA  and  FSKGP  achieve  similar  performance  to  that  of 
the  state-of-the-art  algorithms.  Our  KLFE  adopts  a  similar  strategy,  i.e.,  our  KLFE  first  computes 
an  orthonormal  basis  of  the  kernel  space  and  then  performs  LFE  in  the  kernel  space.  Since  LFE 
achieves  improved  performance  over  RELIEF,  KLFE  is  expected  to  outperform  FSKPCA  and 
FSKGP,  which  are  kernel-based  versions  of  RELIEF. 

In  the  next  section,  we  briefly  review  RELIEF  and  LFE  before  we  present  our  KLFE  in  Sec¬ 
tion  2.4. 

2.3  Review  of  RELIEF  and  LFE 

In  this  section,  we  briefly  review  RELIEF  and  LFE  since  LFE  is  an  extension  of  RELIEF  and 
KLFE  is  an  extension  of  LFE.  We  first  define  two  terms,  nearest  hit  (NH)  and  nearest  miss  (NM) 
in  Section  2.3.1,  which  will  be  used  in  all  of  the  three  algorithms,  i.e.,  RELIEF,  LFE,  and  KLFE. 
Then  we  introduce  RELIEF  and  LFE  in  Sections  2.3.2  and  2.3.3,  respectively. 
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2.3.1  Nearest  Hit  (NH)  and  Nearest  Miss  (NM) 


Suppose  we  are  given  a  training  data  set,  as  shown  in  Eq.  (1).  For  any  pattern  (x,  y)  6  V,  we 
define  its  nearest  hit  (NH)  as 


NH(x,y)  —  argminx/  ||x'  —  x||p  (5) 

s.t.  (x',7/')  e  V,  (6) 

y'  =  y ,  (7) 

and  its  nearest  miss  (NM)  as 

NM (x,  y)  —  arg  minx/  |  |x'  —  x| \p  (8) 

s.t.  (x',7/')  e  V,  (9) 

y'  ±  y,  (10) 


where  |  |x|  \p  is  Lp-norm  of  vector  x.  In  this  paper,  we  let  p  —  1,  i.e.,  we  choose  L 1  norm. 

Using  Eqs.  (5)  and  (8),  we  further  denote 

mn  =  xn  -  NM (xn,  yn),  (1 1) 

hn  =  xn  -  NH(yini  yn),  (12) 


for  n  —  1, ... ,  N. 


2.3.2  RELIEF 

Now  we  briefly  introduce  RELIEF.  Denote  by  w  =  [uq, . . . ,  wi]7  the  weight  vector,  where  ivt  is 
the  weight  of  the  z-th  dimension  of  x„  6  R1.  RELIEF  defines  the  margin  of  a  pattern  (xn,  yn)  e  V 
as 

pn=  ||mn||i  -  1 1 h„. Ill,  n  =  1, . . .  ,N.  (13) 

Then  the  objective  of  RELIEF  is  to  maximize  the  overall  margin  over  weight  vector,  i.e., 

max  En=l  Pn( w)  =  En= 1  lmnl  “  ^  lh™l)  >  (14a) 

w 

s.t.  ||w|||  =  l,w  ^  0,  (14b) 
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where  |  •  |  denotes  element-wise  absolute  operator.  Let 


we  simplify  Eq.  (14)  to 


z  =  Eli(|mn|-|hn|), 


max  w 1  z,  s.t.  || w|||  =  l,w  ^  0. 


(15) 


(16) 


Applying  Lagrangian  multipliers  A  and  6  to  Eq.  (16),  one  obtains 

L—  — wTz  +  A  ( 1 1  w 1 1  |  —  1)  +  wT0.  (17) 

Taking  derivative  with  respect  to  w  at  both  sides  of  Eq.  (17)  and  setting  it  to  zero  results  in 


=  -z  +  2Aw  -9  =  0 
=>  w  =  ^(z  +  0)- 


The  closed-form  solution  to  Eq.  (14)  is  [9] 


(z) 

W  =  — — 


z)+||2  : 


(18) 


(19) 


where  (z)+  =  [(z\)  + , . . . ,  (zi)+\r  and  (■)+  =  max(-,0).  The  RELIEF  algorithm  specifies  the 
projection  matrix 


A  = 


w\  0 

0  wj 


2.3.3  LFE 

A  natural  extension  of  RELIEF  is  to  use  a  full  matrix  instead  of  a  diagonal  matrix,  which  results 
in  LFE  [3].  In  LFE,  the  following  optimization  problem  is  considered: 

R$X  En= 1  Pn(W)  =  En= 1  m^Wmn  -  J2n=l  hn  Whn,  (20a) 

S.t.  j|w|||  =  1,  W  ^  0,  (20b) 

where  ||W||,p  is  the  Frobenius  norm  of  W,  i.e.,  ||  W||  w  lj=V52i  Xb  where  (Mi=i  are 

the  eigenvalues  of  W. 

Sun  and  Wu  [3]  proved  Theorem  1,  which  provides  a  solution  to  Eq.  (20). 
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Theorem  1.  Let 


Smh=  X,  (2D 

and  let  { (Vx,  ,  a,)}[=|  be  the  eigen-system  of  Em/,  such  that  o\  ^  cr2  ^  ^  07.  The  solution  for 

Eq.  (20),  up  to  /Ac  difference  of  a  constant,  is 

w  =  ■wf.  (22) 

According  to  Theorem  1,  the  LFE  algorithm  produces  a  projection  matrix  by  maintaining  the 
dimensions  specified  by  eigenvectors  (a which  correspond  to  the  largest  I'  eigenvalues  of 
Emj,  where  I'  f  I  is  the  target  dimension  size  as  defined  in  Eq.  (3)  and  /'  should  be  chosen  such 
that  <7i  ^  ^  or  ^  0;  in  other  words,  LFE  is  defined  by 

/(x)  =  *4x,  x  G  X,  (23) 

where 

A=  [v/or(ai,...,^077a//]T.  (24) 


2.4  Kernel  LFE 

In  this  section,  we  propose  KLFE  algorithm.  This  section  is  organized  as  follows.  In  Section  2.4.1, 
we  present  the  extension  of  LFE  to  a  high-dimensional  space  by  introducing  a  nonlinear  mapping. 
In  Section  2.4.2,  we  prove  that  both  LFE  and  KLFE  are  basis  rotation  invariant  and  thus  KLFE  can 
be  considered  to  perform  LFE  under  an  orthonormal  basis  in  the  kernel  space.  In  Section  2.4.3,  we 
present  an  algorithm  for  implementing  KLFE  by  using  KPCA  to  find  an  orthonormal  basis  in  the 
kernel  space.  We  summarize  KLFE  algorithm  and  describe  the  implementation  details  in  Section 
2.4.4.  Computational  complexity  is  also  analyzed  in  this  section. 

2.4.1  Nonlinear  LFE  in  High-Dimensional  Space 

As  presented  in  Section  2.3.3,  the  LFE  algorithm  is  a  linear  feature  extraction  method  as  in  Eq.  (3). 
Our  idea  of  extending  LFE  is  to  introduce  a  nonlinear  function, 

<p  :  X  C  R1  —>  X  C  R1,  (25) 
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where  usually  /  ^>  /,  which  maps  patterns  from  a  low-dimensional  space  to  a  high-dimensional 
one.  We  call  X  or  WJ  kernel  space ,  which  contains  the  image  of  p.  Then  we  apply  the  LFE 
algorithm  to  kernel  space  X;  the  resulting  algorithm  is  called  KLFE.  We  further  assume  I  P>  N, 
as  is  always  the  case  when  using  a  kernel  method.  Similar  to  Eqs.  (11)  and  (12),  we  let 


mn  =  p(xn)  -  p  (. NM{xn ,  yn)) ,  (26) 

h„  =  p(xn)  -  p  (NH(xn,  yn)) ,  (27) 

n  —  1, . . . ,  N.  KLFE  can  be  obtained  by  solving  the  following  optimization  problem, 

max  J2n=i  mr;  W mn  -  Yln= i  (28a) 

w 

s.t.  ||W|||  =  1,W^0.  (28b) 


Since  the  only  difference  between  Eq.  (28)  and  Eq.  (20)  is  the  use  of  mapping  p,  one  can 
directly  use  Theorem  1  to  solve  Eq.  (28).  Corollary  1.1  summarizes  the  solution  to  Eq.  (28). 

Corollary  1.1.  Let 


ZmH  =  Etl  -  En=i  hnh^,  (29) 

and  let  {(cr,;,  a,;)}[= ,  be  the  eigen-system  ofY>mh,  such  that  d\  ^  ^  ^  dp  The  solution  to 

Eq.  (28),  up  to  the  difference  of  a  constant,  is 

W=  ^ 

{i:a-i>0} 


From  Corollary  1.1,  the  KLFE  algorithm  produces  a  projection  matrix  by  maintaining  the  di¬ 
mensions  specified  by  eigenvectors  {d*}^,  which  correspond  to  the  largest  I'  eigenvalues  of  Em/) 
where  I'  ^  /  is  the  target  dimension  size  as  defined  in  Eq.  (3)  and  I'  should  be  chosen  such  that 
(j\  f  f  ai>  f  0\'m  other  words,  KLFE  is  defined  by 

/(x)  =  Ap  (x) ,  x  e  X ,  (30) 

where 

A=  [v'dlai,. . . ,  v/d^a//]  .  (31) 
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2.4.2  Basis  Rotation  Invariant  Property  of  KLFE 

In  this  section,  we  study  an  important  property  of  KLFE:  basis  rotation  invariance.  Before  we 
show  the  basis  rotation  invariant  property  of  KLFE  in  Proposition  1,  we  present  Lemma  1. 

Lemma  1  states  that  LFE  is  basis  rotation  invariant. 

Lemma  1.  Let  (ej?  'l}[=1  and  {e^1  }[=]  be  two  different  orthonormal  bases  in  input  feature  space 

M7.  Assume  that  {e^ can  be  obtained  by  counterclockwise  rotating  {e(l?7}[=1,  and  the  rotation 
matrix  is  denoted  by  Q.  Then,  LFE  is  basis  rotation  invariant,  i.  e. ,  a  feature  vector  extracted  by 
LFE  under  (e|')}7=1  is  the  same  as  the  feature  vector  extracted  by  LFE  under  {e^  }[=] . 

Proof  Assume  training  samples  are  given  under  basis  {e^}7=1,  i.e., 

V  =  {(xn,yn)}^=1cXxy, 


where  X  C  M7  is  the  /-dimensional  feature  space  and  y  =  {±1}.  Under  basis  {e^}7=1,  LFE  is 
formulated  as  follows: 


max 

Wi 


Eti  Pn(Wr)  =  Eti  mn  w imn  -  Etr  KWiK, 


N 


S.t. 


Willi  =  l,w1^0. 


(32a) 

(32b) 


From  Theorem  1,  the  projection  matrix  of  LFE  under  basis  (e,  }(=1  is  given  by 

«4i  =  [s/df&i,  •  •  • ,  v/oy7a//]T  ,  (33) 

where  V  ^  /  is  the  target  dimension  size  and  { a, }  are  the  eigenvectors  of  Em/,  corresponding  to 
the  largest  I '  eigenvalues. 

Under  basis  {e^  }7=1,  the  training  samples  become 

V  =  {(Qxn,yn)} 


The  nearest  miss  and  nearest  hit  remain  the  same  for  each  data  sample  since  the  Euclidean  dis¬ 
tance  does  not  change  under  different  orthonormal  bases.  Under  basis  (e^  }j=  i,  LFE  is  formulated 
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as  below: 


max  Yfn=\  Pn( W2)  =  En=l (Qm/W2Qm„  -  En=l  (QV 
w2 

)TW2Qh„,  (34a) 

s.t.  ||w2|||  =  1,W2  ^  0. 

(34b) 

Comparing  (32)  and  (34),  we  have  Wi  =  QTW2Q ■  Then  we  have 

w2  =  QWxQt 

(35) 

=  Q(  ^  (Tiaia[)QT 

(36) 

{v.(Ji>  0} 

=  ^  o’iQa-i.af QT 

(37) 

0} 

where  (a)  is  due  to  (22).  Let  the  projection  matrix  of  LFE  under  basis  {e^ 
(37),  (22),  and  (24),  we  have 

\l=1,  be  A2 .  Then,  from 

A2  =  {^o\Qai,...,y/oAQaI>}T  (38) 

=  AxQt  (39) 

where  (a)  is  due  to  (33).  Then,  the  feature  vector  extracted  by  LFE  under  {ej/  }(=1  is  given  by 

A2Q*  =  AiQtQx  (40) 

=  Aix  (41) 

where  (a)  is  due  to  (39);  (b)  is  due  to  the  fact  that  Q  is  an  orthogonal  matrix.  Eq.  (41)  means 
that  a  feature  vector  extracted  by  LFE  under  (e^}f=1,  i.e.,  Alx  is  the  same  as  the  feature  vector 
extracted  by  LFE  under  {e^  }f=1,  i.e.,  A2Qx..  This  completes  the  proof.  □ 

Usually  we  do  not  know  the  dimension  of  the  kernel  subspace  that  contains  the  mapped  data 
samples  (including  training  and  test  data  samples),  but  given  sufficient  number  of  training  samples, 
we  can  estimate  the  dimension  of  this  kernel  subspace.  Assume  the  rank  of  the  mapped  training 
data  samples  {</?(xn)}Ei  is  Nt,  i.e.,  the  mapped  training  data  samples  are  contained  in  an  Nt- 
dimensional  kernel  subspace  denoted  by  S.  Suppose  the  kernel  space  X  has  an  orthonormal  basis 

(eWK=i- 

Proposition  1  states  that  KLFE  is  basis  rotation  invariant  for  bases  in  the  kernel  space. 
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Proposition  1.  Let  {e^}(=1  and  { o ; } _f= ,  be  two  orthonormal  bases  in  kernel  space.  Assume 

that  {e|/,}{=l  can  be  obtained  by  counterclockwise  rotating  {ej?*}[=1,  and  the  rotation  matrix  is 
denoted  by  Q.  Then,  KLFE  is  basis  rotation  invariant  for  all  samples  x  where  p(x)  G  S,  i.e.,  a 

feature  vector  extracted  by  KLFE  under  {e^}f=1  for  input  sample  x  is  the  same  as  the  feature 
vector  extracted  by  KLFE  under  {e^  }f=1. 

Proof  Assume  the  training  sample  set  is 

^'{(Xn,rf=1C^X^ 

where  X  C  M7  is  the  /-dimensional  feature  space  and  y  =  {±1}. 

In  KLFE,  a  sample  is  first  mapped  from  a  low-dimensional  space  to  a  high-dimensional  space 
by  the  following  nonlinear  transformation 

p  :  A  C  M7  -*•  X  C  MJ  (42) 

The  training  sample  set  in  the  kernel  space  under  basis  (e[')}f=1  is  denoted  by 

Vl  =  {(P(xn),yn)}n=l  (43) 

Under  basis  {e2^}f=1,  the  training  sample  set  in  the  kernel  space  becomes 

£>2  =  {(Qp(Xn),yn)}n=l 

Denote  the  projection  matrix  of  KLFE  in  (30)  under  basis  (e[t')}f=1  and  {e^}^  by  A\  and 
A2,  respectively.  Note  that  the  dimension  of  the  feature  vector  extracted  by  KLFE  is  F,  where 
F  <  I.  Since  KLFE  is  equivalent  to  applying  LFE  in  the  kernel  space,  from  Lemma  1,  we  have 

Aip{x)  =  A2Qp{x)  (44) 

This  completes  the  proof.  □ 

Proposition  2.  Let  (e^}f=1  be  an  orthonormal  basis  in  kernel  space.  Denote  by  S  the  kernel 

subspace  spanned  by  training  data  {93 (x,,  )  }^,;  the  dimension  of  S  is  Nt.  Then  an  I -dimensional 
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feature  vector  f(x)  extracted  by  KLFE  under  {e  ^  }[=,  for  input  sample  x  (where  <p(x)  E  S)  must 
be  in  the  form  of 


f{x) 


°i-Nt,  1 


(45) 


where  f\  (x)  is  an  Nt-dimensional  vector  and  Oj_Nt  l  is  an  (/  —  Nt) -dimensional  vector  whose 
entries  are  all  zero.  In  other  words,  the  last  I  —  Nt  entries  of  the  extracted  feature  vector  f(x)  are 
all  zero. 


Proof.  Assume  the  training  sample  set  is 

V  =  {(xn,yn)}^=1cXxy, 


where  X  C  M7  is  the  /-dimensional  feature  space  and  y  =  {±1}. 

After  nonlinear  mapping,  the  training  sample  set  in  the  kernel  space  under  basis  {e^}f=1 
becomes 

£>1  =  {Mxn),2/n)}JLi  (46) 


Let  {e^}^  be  an  orthonormal  basis  for  S.  Denote  by  S  the  complementary  subspace  of  S. 

Let  (e^ ')}f=7Vt+1  be  an  orthonormal  basis  for  S  .  Thus  {e^}^  U{e3'>}f=jvt+i  f°rm  an  orthonor¬ 
mal  basis  for  kernel  space. 

From  Proposition  1 ,  the  feature  vector  extracted  by  KLFE  algorithm  under  basis  |e^' :  }[=1  is  the 
same  as  that  extracted  by  KLFE  under  a  rotated  basis  {e^}^  U(e3^}f=ivt+i-  So  we  can  compute 
the  extracted  feature  vector  under  basis  {e;/  }^  (J{e3  }f=ivt+i  *or  simplicity. 

Since  <p(x)  E  S,  hence  the  last  I—Nt  entries  of  the  coordinates  of  p(x)  under  basis  {e^}^  U(e3  ^ 
are  all  zero.  From  the  definition  of  Emh  in  Eq.  (29),  we  have 


E 


mh 


z*mh  o 

0  0 


(47) 


where  Em/,  is  an  I  x  I  matrix,  and  E*)i(  is  an  Nt  x  Nt  matrix. 

Let  ^denote  the  projection  matrix  of  KLFE  in  (30)  under  basis  {e^}^  U  ie3^}*=w  +  i  •  From 
Corollary  1.1,  Eq.  (30)  and  Eq.  (47),  we  have 


A 


Ai  0 

0  0 


(48) 


i=Nt+ 1 
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where 


Ai  —  [\/aiai, . . . ,  -\/aNt^Nt]  , 
and  {(ex,,  aj)}^1  is  the  eigen-system  of  S *nh . 

From  (30),  the  extracted  feature  vector  f(x )  is  given  by 


(49) 


f(x)  =  A(p{x) 

o 


(50) 


(51) 


where 


(52) 


This  completes  the  proof. 


□ 


It  is  worth  mentioning  that  if  <p(x)  ^  S,  then  the  distance  d(<p(x),  <p'(x))  is  negligible  where 
tp'(x)  is  the  projection  of  <p(x)  onto  S.  The  reason  is  that  if  the  training  data  ,  and  the 

test  data  x  are  sampled  from  the  same  distribution,  |  |<^(x)  —  <p'(x)  1 1  is  mainly  caused  by  irrelevant 
features  or  measurement  noise  [10]. 

From  Proposition  2,  we  can  perform  feature  extraction  in  kernel  subspace  S  in  which  the  basis 
can  be  expressed  by  linear  combinations  of  mapped  data  samples  in  the  kernel  space.  In  this  way, 
we  can  simplify  the  computation  involved  in  KLFE.  Proposition  2  shows  that  KLFE  can  extract  at 
most  Nt  dimensional  nonzero  feature  vector  for  arbitrary  input  sample  which  lies  in  S. 

Based  on  the  two  propositions,  KLFE  can  be  computed  in  three  steps.  First,  we  find  a  basis  in 
kernel  subspace.  This  can  be  done  by  using  KPCA  or  Kernel  Gram-Schmidt  Procedure  (KGP)  [10]; 
the  dimension  of  the  basis  is  equal  to  the  rank  of  the  mapped  data  set  in  kernel  space.  Second,  the 
data  in  the  kernel  space  can  be  mapped  onto  the  basis,  each  basis  vector  of  which  is  a  linear 
combination  of  the  mapped  data  {^(x)},  i.e.,  Note  that  the  kernel  method 

can  be  used  to  obtain  the  kernel  feature  under  the  basis.  Third,  we  perform  LFE  on  the  resulting 
kernel  features  which  have  dimension  of  Nt. 

Next,  we  present  the  final  KLFE  algorithm  by  using  KPCA  to  find  a  basis  in  kernel  subspace. 
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2.4.3  KLFE  using  KPCA 


For  a  given  (p,  its  kernel  function,  /C  :  X  x  X  — ►  M,  is  defined  as 

/C  (x1;  x2)  =  (<p(xi),  <^(x2)) ,  (53) 

where  <  v  >  represents  inner-product  operator.  It  is  known  that  /C  and  ip  have  1-to-l  mapping 
[11].  In  other  words,  we  can  ignore  the  explicit  form  of  p  by  using  a  given  /C  directly,  as  long  as 
all  computations  are  conducted  through  inner  product. 

Without  loss  of  generality,  assume  the  average  of  the  data  samples  in  kernel  space  is  zero.  Let 

K  =  X'  X  be  the  kernel  matrix,  where  matrix  X  =  [^(x, ),  </?(x2),  •  •  •  ,  ^(x;Y)]-  Hence  the  entry 
of  Lth  row,  j-th  column  in  K  is  given  by 

Kid  =<  </?(xj),(/?(xj)  >=  /C(x;,x,,) ,  i  —  1, . . . ,  N;  j  —  1, . . . ,  N.  (54) 


Let  {7n,Vn}^=i  be  the  eigen-system  of  K,  where  the  eigenvalues  are  sorted  in  decreasing 
order,  i.e.,  71  ^  72  ^  ^  7at.  Then,  by  definition  of  eigenvalue  decomposition, 


XTXvn  =  7„vn, 

(55) 

XXT  (Xvn)  =  7n  (Xv„)  , 

(56) 

for  n  =  1, . . . ,  N.  As  a  result,  {7n}^=1  are  the  N  largest  eigenvalues  of  XXT,  whose  correspond¬ 
ing  eigenvectors  are  {Xvra  .  Denote  the  dimension  of  matrix  K  by  N'.  If  N'  =  N,  then  7,  >  0 
for  i  —  1,  2,  ■  ■  ■  ,  N.  If  N’  <  N,  we  perform  LFE  in  the  kernel  subspace  of  dimension  N'. 

r  -  n 

Normalizing  eigenvectors  |Xvn  j'  produces  an  orthonormal  basis 


\Er  = 


X 


Vi  VjV 

x/Tl’  '  '  '  ’  y/lN 


=  XV. 


(57) 


Thus  for  input  vector  xn  (n  =  1, . . . ,  X),  the  feature  vector  extracted  by  KLFE  under  the  basis  in 
(57)  is  given  by 


x„  = 


XTp(ptn) 


V,TXT^(xn)  =  Y'TK^n\ 


(58) 
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where  A' 60  denotes  the  n- th  column  of  kernel  matrix  K.  More  generally,  for  any  x  G  X,  we  have 


x=  V 


IT 


/C(xi,x) 


=  ^(x)- 


/C(xat,x) 

Eq.  (59)  actually  specifies  an  iV-dimensional  kernel  space, 

X  =  (</5(x)  :  x  G  X} 


(59) 


(60) 


We  summarize  KPCA-based  KLFE  algorithm  as  follows.  Let 

m„  =  ip(xn)  -  (p  (. NM (xn,  yn)) ,  (61) 

h„  =  (p{xn)  -  (p  (NH(xn,  yn)) ,  (62) 

n  =  1, . . . ,  N.  The  KLFE  algorithm  solves  the  following  optimization  problem, 

max  E«=i  rnIWm„  -  E«=i  h^Whn,  (63a) 

w 

s.t.  ||W|||  =  1,W  7>  0.  (63b) 


Using  the  result  in  Theorem  1,  the  solution  to  Eq.  (63)  is  given  in  Theorem  2. 

Theorem  2.  Let 

^mh  =  En=l  -  En= 1  hn^  (64) 

and  let  { (d,.  A)  ]E  ,  be  the  eigen-system  of  such  that  d\  f  ■■■  f  <7 at.  The  solution  to 
Eq.  (63),  up  to  the  difference  of  a  constant,  is 

W  ^  ^  Gn<in£ ln  . 

{n:<7n>  0} 


Accordingly,  the  projection  matrix  is 


A—  [y/ofai, . . . ,  v/cua//] 


For  an  input  x,  the  extracted  feature  is  given  by 


/(x)  =  Ap(x)  =  .4.V' 


IT 


/C(xi,x) 


/C(xN,x) 


where  V'  A  defined  in  Eq.  (57). 


(65) 


(66) 
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KLFE  is  superior  to  LFE  in  that  it  performs  LFE  in  a  high-dimensional  space,  where  discrim¬ 
inant  information  is  much  easier  to  extract.  From  the  above  analysis,  KLFE  can  be  considered  as 
KPCA  followed  by  LFE.  Therefore,  KLFE  is  superior  over  KPCA  since  KLFE  takes  into  account 
the  label  information;  KLFE  also  outperforms  kernel  RELIEF,  i.e.,  FSKPCA  and  FSKGP  [10], 
where  FSKPCA  is  KPCA  followed  by  RELIEF. 

2.4.4  KLFE  Algorithm 

Now  the  pseudo-code  of  KLFE  is  shown.  In  the  initialization  step,  we  need  some  parameters,  like 
the  number  of  neighbors  for  computing  Em/, .  Assuming  that  we  use  the  RBF  kernel  and  K-nearest- 
neighbors  as  classifier,  we  need  the  width  of  RBF  kernel  and  the  number  of  neighbors  for  KNN.  In 
our  experiments,  we  use  10-fold  cross  validation  to  find  these  parameters. 

If  we  use  complex  tuning  method  to  find  better  parameter  set,  the  performance  of  KLFE  will  be 
improved.  In  this  paper,  we  do  not  focus  on  complex  tuning  method  or  classification  method. 
Therefore  we  use  simple  tuning  method:  10-fold  cross  validation  to  tune  all  parameters  needed. 
For  each  parameter,  we  use  only  5-10  candidate  points. 

The  complexity  of  KLFE  depends  on  the  kernel  function.  For  example,  consider  the  radial 
basis  function  (RBF)  kernel  [11,  page  77],  which  is  given  by 


(67) 


The  complexity  of  computing  kernel  matrix,  i.e.,  Eq.  (54),  is  O  ( N2I ).  The  complexity  of  eigen¬ 
value  decomposition  for  kernel  matrix  and  the  complexity  of  LFE  in  the  iV-dimensional  kernel 
space  are  both  O  ( N 3).  As  a  result,  the  overall  complexity  of  KLFE  using  RBF  kernel  is 


o  ( N2I )  +  O  (iV3)  . 


(68) 


It  is  comparable  to  the  complexity  of  LFE,  which  is  also  O  ( N2I )  +  O  ( N 3)  [3].  To  reduce  the 
computational  complexity  of  KLFE,  we  can  use  Kernel  Gram-Schmidt  Procedure  [10]  to  find  a 
basis  instead  of  using  KPCA. 
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Algorithm  KLFE 


Input:  Training  samples  X  =  [x\ . . .  xN]  and  labels  Y  =  [yi. .  .yN\ 

1)  Initialization 

Normalize  X,  give  kernel  parameter,  number  of  neighbors  L. 

2)  Mapping  to  kernel  space 

2.1)  K  =  kernel(X),  K  is  the  kernel  of  A". 

2.2)  [V,  D]  =  EigenDecomposition(K), 

U’s  column  contains  one  principal  component,  D  is  a  diagonal  matrix  with  eigenvalues. 
All  the  zero  values  are  removed. 


3) 


2.3) 

LFE 

3.1) 


A  =  DVtK 
for  n  —  1  :  N 

Q  is  the  ith  nearest  Xi  labeled  the  same  class  with  x 
r]i  is  the  ith  nearest  x%  labeled  different  class  with  x 


3.2) 

3.3) 

3.4) 


Cl  •  •  •  •t'n  Cl]  >  [x?. 

=  Etr  MnM%  -  Eli  tinHl 


Then  Hn  =  [x. 

E/ mh 


m 


■  Xr, 


[V,  D]  =  EigenDecomposition(Yimh) ,  the  same  as  2.2. 
X  =  Dl/2VTX 


4)  Output:  X. 


Cl] 


2.5  Experimental  Results 

In  this  section,  we  conduct  experiments  on  pattern  classification  to  show  the  performance  of  our 
KLFE  algorithm  and  compare  it  with  existing  feature  extraction  schemes.  This  section  is  organized 
as  follows.  In  Section  2.5.1,  we  describe  the  experimental  setting.  In  Sections  2.5.2  and  2.5.3,  we 
show  the  experimental  results  for  simulated  data  sets  and  real-world  data  sets,  respectively. 

2.5.1  Experimental  Setting 

We  conduct  classification  experiments  on  two  types  of  data  sets,  namely,  simulated  data  sets  (sine- 
surface  and  Swiss  roll)  and  real-world  data  sets  (UCI  Machine  Learning  Repository  [12]  and  USPS 
digit  handwriting  data).  In  our  experiments,  we  use  two  data  sets  from  UCI  Machine  Learning 
Repository,  i.e.,  data  sets  for  diabetes,  and  ringnorm.  For  USPS  data,  we  choose  only  two  digits, 
namely  ”3  versus  5”,  since  they  are  the  most  challenging  digits  for  recognition.  (See  Fig.  2)  We 
exchange  the  ’’traditional”  training  sets  and  testing  sets  as  shown  in  Table  1. 

To  make  a  fair  comparison,  we  compare  KLFE  with  the  following  feature  extraction  schemes, 
which  are  also  based  on  kernel.  We  also  compare  KLFE  with  origin  LFE. 
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Figure  2:  USPS  handwritten  digits  3(top  row)  and  5(bottom  row). 


Table  1:  UCI  and  USPS  data  sets  used  in  the  experiments 


data  set 

training  sample  size 

testing  sample  size 

number  of  features 

UCI-diabetes 

468 

300 

8 

UCI-ringnorm 

400 

7000 

20 

USPS-(3vs.5) 

326 

1214 

256 

1.  Generalized  Discriminant  Analysis  (GDA)  using  a  kernel  approach  [13] 

2.  KPCA 

3.  FSKPCA,  which  is  one  type  of  algorithm  for  kernel  RELIEF 

4.  Kernel  K-Nearest  Neighbor  (KKNN). 

GDA  can  generate  at  most  n  —  1  features  where  n  is  the  number  of  categories/classes.  In 
this  paper,  we  only  study  feature  extraction  methods  for  binary  classification  problem.  KKNN  is 
kernelized  K- nearest-neighbor  (KNN)  [6,  page  174]  based  on  a  distance  function  induced  by  the 
kernel  function. 

Note  that  KLFE,  GDA,  KPCA  and  FSKPCA  are  feature  extraction  algorithms  and  we  are 
interested  in  their  classification  capability,  i.e.,  how  well  a  given  classifier  performs  if  the  classifier 
uses  the  features  obtained  from  these  feature  extraction  algorithms.  In  our  experiments,  we  choose 
KNN  as  the  classifier  because  of  two  reasons.  First,  KNN  is  a  simple  yet  effective  classifier,  which 
often  yields  competitive  results,  compared  to  some  advanced  machine  learning  algorithms  [14]. 
Second,  the  focus  of  this  paper  is  not  on  an  optimal  classifier  for  each  dataset.  KNN  is  surely  not  an 
optimal  classifier  in  many  cases  but  it  provides  a  platform  where  we  can  compare  different  feature 
extraction  algorithms  with  a  reasonable  computational  cost.  Actually,  for  fair  comparison,  we  let 
K  =  1.  Then  we  can  explicitly  see  how  these  feature  extraction  algorithms  improve  classification 
ability  of  KNN  in  kernel  space,  i.e.,  KKNN. 

In  the  experiments,  we  use  RBF  kernel  function  defined  by  Eq.  (67)  with  a  =  1.  Our  compari¬ 
son  strategy  is  to  use  the  same  kernel  function  with  the  same  width  a  and  the  same  classifier  KNN 
where  K  =  1.  To  eliminate  statistical  variations,  each  algorithm  is  run  several  times  for  each  data 
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set.  In  each  run,  a  data  set  is  split  into  training  data  subset  and  testing  data  subset  randomly.  Then 
the  testing  error  rate  is  obtained  by  averaging  over  all  the  runs. 

2.5.2  Experimental  Results  for  Simulated  Data 

In  this  section,  we  conduct  experiments  on  two  simulated  data  sets:  twin  sine  and  Swiss  roll,  which 
are  both  in  the  following  form: 


V  =  {(xn,  yn)}n=i , 


where 


yne  {-1,1}, 
xn  =  Rlx 3  X 


x(1) 


x(2) 


r(3) 


(69) 

(70) 

(71) 


where  Rjx 3  denotes  a  random  1x3  matrix. 

For  twin  sine  data,  x?  is  a  random  variable  uniformly  distributed  in  [0,27 r];  x[,2)  is  a  random 
variable  and  x[,2)  =  sin  ^xi1^  +  l(yn  =  1)  x  D  +  /3N,  where  J(-)  denotes  an  indicator  function, 

D  is  a  constant  and  (3n  denotes  a  Gaussian  random  variable  with  zero  mean  and  variance  a2;  x^ 
is  a  random  variable  uniformly  distributed  in  [0, 1].  Actually,  the  data  set  is  composed  of  two  3- 
dimensional  sine  surfaces,  labeled  as  —1  and  1,  with  additive  Gaussian  noise  Pn,  and  the  two  sine 
surfaces  are  separated  apart  by  a  distance  D.  Then  the  3-dimensional  vector  x}1'1.  x[,2).  x}}']7  is 
mapped  to  a  /-dimensional  vector  by  matrix  Rlx  3. 

Fig.  3(a)  shows  simulated  3-dimensional  sine-surfaces  and  the  data  set  of  [xl'\ x[2\ x[}]T. 
Fig.  3(b)  shows  the  projection  of  the  data  points  in  Fig.  3(a)  onto  x^  —  x{2)  plane. 

We  further  consider  the  relationship  among  classification  error  rate,  separation  distance  D,  and 
noise  variance  a2.  It  is  obvious  that,  as  D  increases,  the  two  sine-surfaces  are  further  away  from 
each  other,  resulting  in  better  classification  accuracy.  Similarly,  as  a2  decreases,  the  probability 
that  samples  from  two  classes  overlap  decreases,  which  also  increases  the  classification  accuracy. 
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x"  (#•  x") 

(a)  3D  View  (b)  Projection  of  the  data  points  in  Fig.  3(a)  onto 

-  ySn]  plane 

Figure  3:  Simulated  data  set  containing  sine  surfaces. 

Hence,  we  define  signal-noise-rate  (SNR)  as 

SNR  =  g  (72) 

SNR  (dB)  =  201og10(f).  (73) 

Figs.  5(a)  and  5(b)  show  classification  error  rate  vs.  target  feature  dimension  /'  for  different 
schemes  under  SNR  =  OdB  and  -5dB,  respectively. 

For  the  Swiss  roll  data,  we  let  xf,1  J  =  6  x  008(6*)  and  x[2*  =  6  x  sin(0),  where  6  is  a  random 

variable  uniformly  distributed  in  [0, 47r].  Actually  the  xl'  ’’ -xi2 i  curve  is  a  helix.  x[l3'1  is  a  random 
variable  uniformly  distributed  in  [0,  2];  then  the  data  set  is  a  3-D  helix  surfaces.  We  label  sam¬ 
ples  with  6  e  [0,  27t]  as  —1  and  those  with  6  e  (27t,47t]  as  1.  Then  the  3-dimensional  vector 

[xn  \  xi2  \  Xn^]T  is  mapped  to  a  /-dimensional  vector  by  matrix  /?  /  x  3 . 

Fig.  4(a)  shows  simulated  3-dimensional  Swiss  roll  and  the  data  set  of  [xi^ ,  x[,2,1 ,  x„!)  |7  .  Fig.  3(b) 
shows  the  projection  of  the  data  points  in  Fig.  3(a)  onto  x„  —  x„  plane. 


The  classification  error  rates  are  all  averaged  over  10  simulation  runs.  From  Figs.  5(a)  and 
5(b),  it  is  observed  that  KLFE+KNN  achieves  the  minimum  classification  error  rate  among  all  the 
schemes,  for  /'  ^  10;  KLFE+KNN  is  able  to  reduce  the  classification  error  rate  by  more  than  10%, 
compared  to  other  four  schemes.  From  Fig.  6,  we  can  see  that  KLFE  is  similar  with  LFE.  Both 
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(a)  3D  View 


(b)  Projection  of  the  data  points  in  Fig.  4(a)  onto 

x!,1 '  —  xj,2'  plane,  where  a  circle  represents  a  point 
with  label  —  1,  and  *  represents  a  point  with  label  1 


Figure  4:  Simulated  data  set  containing  Swiss  roll. 

KLFE  and  LFE  are  better  than  PCA.  When  dimension  equals  to  only  1,  KLFE  can  achieve  good 
results  while  the  other  two  perform  much  worse.  In  addition,  our  KLFE  is  quite  robust  against  the 
change  of  target  feature  dimension  V\  this  is  because  KLFE  has  an  explicit  mechanism  to  eliminate 
irrelevant  features. 


2.5.3  Experimental  Results  for  Real-World  Datasets 

In  this  section,  we  conduct  experiments  on  three  real-world  data  sets:  UCI-diabetes,  UCI-ringnorm, 
and  USPS-3vs5  (digit  “3”  vs.  “5”). 

Fig.  7(a)  shows  classification  error  rate  vs.  target  feature  dimension  I'  for  different  schemes 
on  diabetes  dataset.  It  is  observed  that  KLFE+KNN  achieves  the  minimum  classification  error  rate 
among  all  the  schemes,  for  I'  ^  30. 

Fig.  7(b)  shows  classification  error  rate  vs.  target  feature  dimension  I'  for  different  schemes 
on  ringnorm  dataset.  It  is  observed  that  KLFE+KNN  improves  the  classification  ability  of  KKNN 
dramatically,  by  reducing  the  classification  error  rate  by  more  than  50%.  GDA  also  yields  good 
performance  but  FSKPCA  and  KPCA  give  quite  poor  results,  which  are  even  worse  than  KKNN. 

Fig.  8  shows  classification  error  rate  vs.  target  feature  dimension  /'  for  three  different  schemes: 
KLFE+KNN,  LFE+KNN,  and  PCA+KNN.  In  this  experiment,  KLFE  performs  better  than  PCA, 
and  achieves  performance  similar  to  that  of  LFE. 
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(a)  TwinSine,  SNR  =  0  dB 


(b)  TwinSine,  SNR  =  -5  dB 


Figure  5:  Classification  error  rate  vs.  target  feature  dimension  of  simulated  data. 

In  summary,  our  proposed  KLFE  achieves  superior  performance  over  the  existing  algorithms 
in  most  cases.  Note  that  there  are  strong  relationship  among  KPCA,  KLFE,  and  FSKPCA.  Both 
KLFE  and  FSKPCA  find  a  basis  in  kernel  subspace,  differing  in  that  KLFE  uses  feature  extraction 
matrix  to  maximize  the  average  margin  whereas  FSKPCA  uses  feature  weighting  to  maximize  the 
average  margin.  Compared  to  KLFE  and  FSKPCA,  which  use  supervised  learning,  KPCA  uses 
unsupervised  learning  (i.e.,  without  using  label  information). 

It  is  worth  mentioning  that  our  experiments  focus  on  comparison  of  various  feature  extraction 
methods  rather  than  optimal  classifier  design.  In  fact,  in  order  to  achieve  best  classification  per¬ 
formance  using  KLFE+KNN,  we  should  select  the  optimal  K  for  KNN  under  KLFE,  and  the  best 
kernel  function.  But  for  fair  comparison,  we  just  use  the  same  classifier  and  the  same  parameter 
setting  for  all  the  feature  extraction  methods. 


2.6  Conclusion 

This  work  is  concerned  with  feature  extraction  techniques  for  pattern  classification  applications. 
A  good  feature  extraction  algorithm  is  critical  in  a  pattern  classification  system  as  it  helps  reduce 
system  complexity  and  enhance  classification  accuracy  by  eliminating  irrelevant  features. 

In  this  work,  we  proposed  a  novel  feature  extraction  algorithm,  referred  to  as  KLFE,  which  is 
a  generalization  of  LFE.  The  power  of  KLFE  lies  in  the  fact  that  KLFE  has  the  good  properties  of 
a  feature  extraction  technique,  i.e.,  it  is  a  nonlinear  wrapper  feature  extraction  method  that  solves 
a  convex  optimization  problem.  Although  nonlinearly  mapping  a  pattern  to  a  high-dimensional 
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classification  error  rate 


Figure  6:  Classification  error  rate  vs.  target  feature  dimension  on  Swiss  Roll 


(a)  UCI:  diabetes 


(b)  UCI:  ringnorm 


Figure  7:  Classification  error  rate  vs.  target  feature  dimension  of  UCI  data. 


Figure  8:  Classification  error  rate  vs.  target  feature  dimension  on  usps  3  vs.  5. 
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space  followed  by  LFE,  seems  to  incur  extremely  high  computation  complexity,  we  theoretically 
proved  that  LFE  and  KLFE  are  both  basis  rotation  invariant,  which  allows  us  to  implement  KLFE 
via  KPCA  or  KGP  followed  by  LFE. 

As  shown  in  Eq.  (68),  the  overall  computation  complexity  of  KLFE  using  RBF  kernel  function 
is  O  ( N2I )  +  O  (iV3),  comparable  to  LFE  in  original  feature  space.  In  other  words,  KLFE  has 
the  advantage  of  better  discriminant  information  extraction  in  high-dimensional  space,  while  pre¬ 
serving  a  comparable  computation  complexity  as  LFE  in  low-dimensional  space.  The  experiments 
conducted  on  both  simulated  data  set  and  three  real-world  data  sets  demonstrate  the  effectiveness 
and  robustness  of  our  KLFE  algorithm. 

3  Identification  of  Network  Dynamics  under  Sparsity  and  Sta- 
tionarity  Constraints 

The  sparse  vector  autoregressive  model  (VAR)  is  commonly  used  for  modeling  dynamic  networks, 
such  as  tank  movements,  troops  movements,  brain  functional  networks,  stock  markets  and  social 
networks.  A  penalized  linear  regression  was  proposed  to  identify  the  autoregressive  coefficient 
matrices  with  sparsity  constraint.  However,  though  the  VAR  model  is  assumed  to  be  stationary, 
this  property  is  never  taken  into  consideration  by  the  penalized  linear  regression.  Moreover,  the 
present  techniques  for  estimating  a  VAR  model  are  only  applicable  to  the  problems  with  relatively 
low  dimensionality  and  large  number  of  observations.  The  main  purpose  of  this  work  is  to  tackle 
these  challenging  issues.  We  formulate  the  problem  as  penalized  linear  regression  with  stationarity 
constraint,  and  propose  the  Berhu  iterative  sparsity  pursuit  with  stationarity  constraint  (BIPS)  to 
solve  the  problem  efficiently.  Berhu  is  a  novel  scheme  with  hybrid  penalty  that  improves  the  Lasso 
scheme  for  high  collinearity  problem.  We  also  implement  the  screening  technique  into  BIPS  for 
dealing  with  the  “large  p  small  n”  problem.  A  bootstrap  enhanced  learning  procedure  is  applied  to 
approximate  the  probability  of  existence  for  each  connection.  Experiments  show  that  our  method 
guarantees  a  stationary  estimate,  outperforms  Lasso  in  estimation  accuracy,  and  works  well  for 
high-dimensional  problems.  Next,  we  present  the  technical  details. 

3.1  Introduction 

There  is  recently  much  interest  in  identifying  the  network  dynamics  and  behaviors  in  the  emerging 
scientific  discipline  of  network  science  [15].  In  a  dynamic  network,  the  evolution  of  a  node  is  con- 
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trolled  not  only  by  itself,  but  also  by  other  nodes.  Take  the  gene  regulatory  network  for  example, 
the  expression  levels  of  genes  interact  with  each  other,  following  some  dynamic  rule  and  struc¬ 
ture.  The  interacting  relations  connect  the  genes  together,  which  forms  a  dynamic  network.  If  the 
topology  and  evolution  of  such  network  is  known,  we  can  analyze  the  regulation  between  genes,  or 
detect  unusual  behaviors  to  help  diagnose  and  cure  genetic  diseases.  Similarly,  the  modeling  and 
estimation  of  dynamic  networks  is  also  of  great  importance  for  various  domains  including  stock 
market,  brain  network  [16]  and  social  network  [17].  Therefore,  to  accurately  identify  the  topology 
and  dynamics  underlying  such  network,  scientists  are  devoted  to  find  appropriate  mathematical 
models  and  corresponding  estimation  methods. 

In  practice,  we  can  obtain  discrete  observations  of  the  network  over  a  period  of  time.  For 
example,  the  expression  levels  of  genes  are  collected  at  different  time  points  in  the  microarray 
experiment,  and  the  macroeconomic  data  of  U.S.  are  recorded  monthly  or  seasonally.  These  mul¬ 
tivariate  time  series  contains  important  information  for  network  estimation  or  analysis.  The  vector 
autoregressive  model  (VAR)  is  one  of  the  most  commonly  used  models  for  multivariate  time  se¬ 
ries  [18].  In  a  VAR  model,  the  state  of  each  node  in  the  network  is  characterized  by  a  time  series. 
The  value  of  a  node  at  the  current  time  point  is  a  linear  combination  of  the  past  values  of  itself  and 
other  nodes  that  regulate  it.  This  regulation  relationship  is  illustrated  by  the  model’s  coefficient 
matrices,  which  can  be  estimated  using  the  observed  data. 

When  estimating  the  coefficient  matrices,  we  not  only  want  the  estimate  to  fit  the  training 
data  well,  but  also  need  a  topology  that  is  easy  to  interpret  and  illustrates  the  most  important 
connections  in  the  network.  A  “sparse”  topology  is  not  only  easy  to  analyze,  but  also  agrees 
with  reality.  For  example,  it  has  been  believed  by  geneticists  that,  despite  the  large  scale  of  a  gene 
regulatory  network,  each  gene  is  normally  only  regulated  by  a  few  number  of  other  genes  [19].  And 
it  is  a  great  challenge  to  identify  such  regulation  relationships  accurately.  Compressive  sensing 
approaches  such  as  penalized  linear  regression  are  applied  by  researchers  to  achieve  a  small  fitting 
error  and  a  sparse  structure  simultaneously  [20].  Penalized  linear  regression  has  been  studied 
a  lot  in  the  past  decades.  Different  penalties  and  algorithms  are  proposed.  A  comprehensive 
review  is  provided  by  [21].  The  L  \  penalty  is  popular  for  its  elegance  in  theory  and  simplicity 
in  implementation.  However,  the  problem  for  applying  L  \  penalty  to  estimate  the  VAR  model 
is  its  incapability  of  dealing  with  collinearity.  Moreover,  it  also  suffers  from  inconsistency  and 
biasness.  To  improve  these  drawbacks,  we  study  a  new  penalty  Berhu,  which  combines  the  L  \  and 
L‘2  penalties  in  a  novel  fashion.  An  iterative  thresholding  algorithm  is  proposed  to  efficiently  solve 
penalized  linear  regression  with  Berhu  penalty. 
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For  a  network  to  stay  stable  and  function  normally,  the  VAR  model  must  be  stationary.  Imagine 
that  the  macroeconomic  data  is  not  a  stationary  process,  it  is  easy  for  one  or  more  indices  to  grow 
exponentially,  and  the  whole  economic  environment  must  collapse  eventually.  Mathematically,  the 
maximum  likelihood  estimation  and  penalized  linear  regression  all  depend  on  the  assumption  that 
the  VAR  process  is  stationary  [18].  However,  stationarity  has  never  been  taken  into  consideration 
by  the  estimation  algorithms.  In  fact,  as  will  be  shown  later  in  the  experiment,  it  is  possible  for 
penalized  linear  regression  to  give  a  nonstationary  estimate  even  when  the  true  model  is  stationary. 
In  this  paper,  we  aim  at  dealing  with  the  stationarity  property  of  VAR  model,  which  is  of  great 
importance  but  has  never  been  tackled  properly  before.  We  will  show  that  the  penalized  linear 
regression  method  may  fail  to  give  a  stationary  estimate,  and  propose  the  Berhu  iterative  sparsity 
pursuit  with  stationarity  constraint  (BIPS)  to  overcome  this  shortcoming.  Experiment  shows  that 
our  method  can  guarantee  a  stationary  and  sparse  estimate  as  well  as  give  a  satisfactory  identifica¬ 
tion  accuracy.  Moreover,  when  used  for  forecasting  or  prediction,  our  algorithm  outperforms  the 
simple  penalized  regression  in  a  fundamentally  different  fashion. 

Nowadays,  we  are  facing  more  and  more  challenges  from  high  dimensional  data  [22].  For 
example,  A  microarray  dataset  often  has  thousands  of  genes  but  fewer  than  one  hundred  of  samples. 
This  so-called  “large  p  small  n”  problem  adds  difficulties  to  our  estimation  method  in  terms  of 
accuracy  as  well  as  computation  cost.  Dimensionality  reduction  techniques  such  as  screening  [23] 
can  help  ease  the  computation  burden  and  improve  accuracy.  Nevertheless,  a  “one-time”  estimate, 
without  p-value  or  confidence  interval,  may  not  be  satisfactory  in  practice.  To  address  this  issue, 
we  propose  the  bootstrap  enhanced  learning  procedure.  Instead  of  selecting  connections,  bootstrap 
provides  us  the  probability  for  each  connection  to  exist.  We  will  use  the  stationary  bootstrap  [24], 
which  keeps  the  stationarity  property  of  each  bootstrap  sample. 

This  section  is  organized  as  follows.  Section  3.2  introduces  the  stationary  and  sparse  VAR 
model,  and  formulates  an  optimization  problem  to  estimate  such  a  model.  Section  3.3  proposes  our 
algorithms,  mainly  the  Berhu  iterative  sparsity  pursuit  with  stationarity  constraint  (BIPS),  and  the 
thresholding-based  iterative  screening  (TIS).  The  bootstrap  enhanced  BIPS  (BE-BIPS)  is  described 
in  Section  3.4.  Section  3.5  shows  experimental  results  on  synthetic  data.  In  Section  3.6,  we  apply 
the  proposed  framework  to  the  U.S.  macroeconomic  data.  Section  3.7  concludes  our  work. 
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3.2  The  VAR  Model  and  Problem  Formulation 


In  literature,  vector  autoregressive  (VAR)  model  is  commonly  used  for  modeling  networks  whose 
dynamics  are  described  by  multivariate  time  series  [18]: 


xt  =  ^2 A{k)xt-k  +  et. 

k=  1 


In  this  model,  a?  is  a  p-dimensional  vector  with  each  component  being  a  time  series  observed 
from  one  node  in  the  network,  where  p  is  the  number  of  nodes  in  the  network.  And  et  is  p- 
dimensional  multivariate  Gaussian  noise:  et  ~  A/"(0,  Se).  The  nodes  are  connected  with  each 
other  through  the  autoregressive  coefficient  matrices  A^’ s,  and  their  dynamics  are  influenced  by 
each  other  through  A^' s.  The  VAR  model  will  be  generally  more  powerful  in  modeling  complex 
dynamic  networks  with  a  larger  order  m.  Nevertheless,  for  most  of  practical  purposes,  it  is  suffi¬ 
cient  to  use  a  first-order  VAR  model  to  approximate  the  network  behaviors.  Hence,  we  will  mainly 
consider  the  first-order  VAR  model: 


xt  =  Axt_i  +  et,  (74) 

where  A  =  A ^  is  the  autoregressive  coefficient  matrix  for  time  lag  one.  The  coefficient  matrix 
A  =  describes  a  weighted  digraph  that  represents  the  dynamic  network:  there  is  a 

connection  link  from  node  j  to  node  i  with  weight  al3 .  We  call  this  graph  the  connection  graph 
of  the  network.  In  econometrics  and  bioinformatics,  this  kind  of  connection  is  called  Granger 
causality  [25],  since  it  illustrates  the  causal  relationship  between  two  nodes. 

3.2.1  Conditional  Maximum  Likelihood  Estimation  of  VAR 

Given  n  observations  of  the  dynamic  network  X\  {x‘,}”=] ,  we  wish  to  infer  the  connection  graph, 
i.e.,  to  estimate  A.  Assuming  a  stationary  process,  we  can  write  the  likelihood  function  of  A  as 

L(A\xi,  ■■■,*„)  =  f(x i,  •  •  •  ,xn\ A) 

n 

=  ,x2,xllA)f(x1\A ) 

t= 2 
n 

=  Y[f{xt\xt-1,A)f{xi\A). 

t= 2 
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The  last  but  one  equation  comes  from  the  chain  rule  of  conditional  probability.  The  last  equa¬ 
tion  comes  from  the  Markov-chain  property.  For  simplicity,  we  consider  the  conditional  likelihood 
function  where  X\  is  assumed  to  be  not  random: 


lm) = n  f{xt\xt-i,A) 

t=  2 
n 

=  JJ(27r)“p/-|Se| ~1/2exp{--(xt  -  Axt_i)TTjJ1(xt  -  Axt_ i)}. 

t=  2 


We  further  assume  independent  multivariate  Gaussian  noise  with  the  same  variance,  which 
means  =  al.  Then  the  conditional  maximum  likelihood  estimation  (CMLE)  can  be  obtained 
by  solving 


A 


1  n 

cmle  =  arg  min  -  ^ 

Z  t= 2 


\xt  -  Axt_ 


1 II 2 


=  argmjn  ^\\Y  -  AX \\2F  (=  /(A)), 


(75) 


where  Y  —  [x2,  X3,  ■  ■  ■  ,  xn],  and  A"  =  [a?i,  X2,  ■  ■  ■  ,  £cn_i].  We  can  see  that  the  conditional  max¬ 
imum  likelihood  estimate  of  a  VAR  model  is  actually  the  same  with  its  ordinary  linear  regression 
estimate.  On  the  other  hand,  the  exact  maximum  likelihood  estimate  requires  solving  for  a  nonlin¬ 
ear  optimization  problem,  which  is  computationally  expensive  and  will  not  be  discussed  in  detail 
in  this  paper. 


3.2.2  Penalized  Estimation  and  the  Berhu  Penalty 


The  conditional  maximum  likelihood  estimate  Acmle  is  a  dense  matrix,  which  corresponds  to  a 
complete  or  nearly  complete  connection  graph.  Such  a  graph  is  difficult  to  interpret,  and  usually 
does  not  agree  with  the  reality.  As  aforementioned,  a  sparse  connection  graph  is  expected  for  many 
scenarios.  In  other  words,  we  are  expecting  a  sparse  estimate  of  A. 

The  most  direct  idea  to  obtain  a  sparse  solution  is  to  add  a  constraint  on  ||  A||0  in  (75)  and  solve 


A  =  arg  min  f(A) 


s.t.  ||A||0^c, 


(76) 
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where  ||  A||0  is  the  L0  norm  of  A,  which  stands  for  the  number  of  nonzero  elements  in  A,  and  c  is  a 
positive  integer  that  satisfies  c  ^  p2.  It  is  easy  to  verify  that  this  integer  programming  problem  is 
NP-hard  [26].  Hence,  researchers  turn  to  its  Lagrange  form: 

A  =  arg  min  f  (A)  +  A||A||0. 

A 

However,  due  to  non-continuity,  this  problem  is  still  difficult  to  address.  Therefore,  researchers 
propose  to  use  other  penalty  functions  to  approximate  the  L0.  And  this  leads  to  the  general  form 
of  penalized  linear  regression  estimation  (PLRE): 

Aplre  =  arg  min  f(A)  +  P(A ;  A).  (77) 

A 


In  this  paper,  we  only  consider  the  additive  penalties.  And  for  convenience,  we  denote  P(A]  A)  = 
Y7i= i  Sj=i  P{aij'i  A,;/),  where  P(\)  is  a  penalty  function  applied  to  each  component  of  the  regres¬ 
sion  coefficient  a^,  and  \,:j  is  the  corresponding  regulation  parameter(s).  Such  a  penalty  function 
is  designed  to  enforce  a  certain  kind  of  sparsity  on  A.  Hence,  (77)  aims  at  obtaining  a  solution  that 
not  only  gives  a  small  regression  error  but  also  has  a  sparse  structure. 

Different  penalties  have  been  proposed  and  studied.  The  famous  Lasso  [27]  essentially  solves 
the  Li  penalty.  It  is  easy  to  solve  thanks  to  its  convexity.  However,  Lasso  suffers  from  several 
drawbacks,  such  as  inconsistency,  biasness  and  incapability  of  dealing  with  collinearity.  The  hard 
penalty  function  [28]  is  designed  to  mimic  the  L0  penalty.  It  does  not  introduce  bias,  but  suf¬ 
fers  instability  in  variable  selection  due  to  non-continuity.  Besides  penalty  functions  with  a  single 
thresholding  parameter,  researchers  also  propose  a  variety  of  hybrid  penalties  with  multiple  regu¬ 
lation  parameters.  Examples  include  the  elastic  net  [29],  the  SCAD  [30]  and  the  hybrid  TISP  [31]. 


In  this  paper,  we  are  going  to  adopt  a  new  hybrid  penalty  Berhu.  Berhu  gains  its  name  from 
Huber  [32],  which  is  a  famous  function  for  robust  regression.  The  Huber  function  (78)  is  quadratic 
in  the  small  variables  and  linear  in  the  larger  ones.  When  used  as  a  criterion  function  for  regression, 
the  Huber  function  is  only  linearly  increased  by  the  large  errors,  which  makes  it  more  robust 
to  outliers  than  the  squared-error  criterion.  Inspired  by  Huber  function,  [33]  designed  a  penalty 
function  called  “Berhu”  (79).  As  implied  by  its  name,  Berhu  is  a  reversed  version  of  Huber:  it 
is  linear  in  small  variables  and  quadratic  in  larger  ones.  Figure  9  shows  an  example  of  the  two 
functions. 


h  m  =  \02  m^M 

[2M\0\-M2  if  \6\  >  M 


(78) 
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(79) 


Figure  9:  Example  of  Huber  and  Berhu. 

The  Berhu  penalty  seems  quite  similar  to  the  elastic  net,  which  also  uses  both  L\  and  L2  penal¬ 
ties.  However,  since  the  elastic  net  only  simply  enforces  both  the  L  \  and  L2  penalties  simultane¬ 
ously  despite  the  value  of  the  variable,  the  singularity  of  the  penalty  function  at  zero  is  mitigated  to 
some  extent  by  the  L2  part,  which  may  lead  to  an  estimate  that  is  not  parsimonious  enough.  Berhu 
overcomes  this  drawback  by  separating  the  L\  and  L2  terms  based  on  the  value  of  the  variable. 
It  puts  Li  penalty  on  the  smaller  elements,  which  guarantees  sparsity,  and  puts  L2  penalty  on  the 
larger  elements,  which  deals  with  the  collinearity  in  a  similar  way  with  ridge  regression  [34].  Thus, 
Berhu  not  only  preserves  the  singularity  property  of  Lasso  at  zero  but  also  has  the  advantages  of 
ridge  regression  in  dealing  with  high  collinearity  and  high  noise  level.  Compared  with  SCAD  and 
hybrid  TISP,  Berhu  enjoys  favorable  properties  of  convexity.  For  example,  path-wise  warm  start 
can  be  used  when  tuning  regularization  parameters.  This  can  save  a  lot  of  computation,  which 
makes  Berhu  more  suitable  than  nonconvex  penalties  for  large-scale  problems. 

3.2.3  Sparse  and  Stationary  Estimation  of  the  VAR  Model 

Stationarity  is  an  important  property  for  VAR  model.  Stationary  and  nonstationary  VAR  processes 
behave  fundamentally  in  different  fashions,  which  can  be  seen  from  Figure  10.  The  probability 
distribution  of  observations  from  a  stationary  VAR  process  is  invariant  with  respect  to  the  shift 
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in  time.  That  is,  f(xtl,xt2,  •  •  ■  ,  xtn)  =  f{xtl+t,  xt2+h  ■  •  •  ,  xtn+l)  for  arbitrary  ,  tn,  all 

n  and  l  =  0,  ±1,  ±2,  •  •  • .  While  in  the  nonstationary  process,  we  can  see  clearly  drifting  and 
trending  behaviors.  In  practice,  we  often  come  across  stationary  VAR  processes.  Even  when  the 
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Figure  10:  Example  of  stationary  and  nonstationary  VAR  processes.  The  number  of  nodes  is 
p  =  50.  The  stationary  VAR  process  has  p(A)  =  0.95,  and  the  nonstationary  VAR  process  has 
p(A)  =  1.05. 

original  process  is  nonstationary,  we  can  reduce  it  to  a  stationary  one  through  some  preprocesses, 
such  as  differencing  the  time  series  and  removing  the  common  trend.  Therefore,  the  input  to 
an  estimation  algorithm  is  usually  a  series  of  stationary  observations.  As  has  been  shown,  the 
conditional  maximum  likelihood  estimate  and  the  penalized  linear  regression  estimate  are  all  based 
on  the  assumption  that  the  VAR  process  is  stationary.  So  it  is  natural  to  require  the  estimate  to  also 
be  stationary.  However,  as  will  be  shown  in  the  experiment,  due  to  estimation  bias  and  error,  the 
penalized  linear  regression  estimate  APLRE  may  violate  the  stationarity  condition.  Hence,  we  need 
to  take  stationarity  into  consideration  and  reformulate  the  estimation  problem. 
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For  the  stationarity  property  to  hold  for  the  VAR  model  (74),  The  spectral  radius  of  A  should 
satisfy  the  stationarity  condition : 


p(A)  =  max  \oti\  <  1,  (80) 

i 

where  a*  is  the  /th  eigenvalue  of  A  [18].  Therefore,  we  add  the  stationarity  condition  to  the 
penalized  linear  regression  estimation  for  VAR  model: 

A  =  arg  min  h\Y  -  AX\\2F  +  PB(A-  A,  M)  (=  1(A)) 

A  2  (81) 
s.t.  p(A)  <  1. 

In  general,  the  penalty  function  in  (81)  can  be  any  penalties  we  discussed  before.  We  focus 
on  the  Berhu  penalty  PB( •)  just  in  favor  of  its  advantages  in  estimating  VAR  model.  The  most 
challenging  part  of  this  problem  is  the  constraint  on  the  spectral  radius  p(A),  since  p(A)  is  a 
nonconvex  and  non-Lipschitz-continuous  function  of  a  generally  nonsymmetric  matrix  A  [35]. 
Hence,  we  consider  a  convex  relaxation  of  (80): 


(82) 


where  1 1  /1 1 1 2  is  the  spectral  norm  (induced  2-norm)  of  A.  Then  problem  (81)  can  be  reformulated 
as 


A  =  arg  min  1(A) 
s.t.  \\A\\2  ^  1. 


(83) 


Throughout  this  paper,  we  will  call  (82)  the  stationarity  constraint,  and  work  on  solving  the 
above  problem. 


3.2.4  Equivalent  Formulations 

Problem  (83)  is  a  nonsmooth  constrained  convex  optimization  problem,  which  can  be  reformulated 
and  then  solved  by  some  well-known  optimization  algorithms,  such  as  semidefinite  programming, 
projected  subgradient  method  and  the  alternating  direction  method  of  multipliers.  Before  propos¬ 
ing  BIPS,  we  first  discuss  briefly  about  these  methods. 
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Semidefinite  Programming  It  can  be  proved  that 


^  1  ^ 


I  A 
AT  I 


y  o. 


Hence,  problem  (83)  can  be  reformulated  as  a  semidefinite  programming(SDP)  problem: 

A  =  argmin  1(A) 


s.t. 


I  A 
AT  I 


y  o. 


However,  since  most  of  the  general  SDP  solvers  use  interior  point  methods,  they  suffer  from 
very  high  time  complexity  and  space  complexity,  especially  for  large-scale  problems  [36].  We  tried 
popular  SDP  solvers  SeDuMi  [37]  and  SDPT3  [38]  for  problem  (84)  using  MATLAB7.11.0  on  a 
PC  with  4GB  memory.  When  the  network  size  p  =  100,  the  program  would  run  out  of  memory. 

Projected  Subgradient  Method  The  projected  subgradient  method  (PSGM)  is  proposed  to 
solve  constrained  convex  optimization  with  nonsmooth  objective  functions  [39].  Define  the  sub¬ 
gradient  of  the  objective  function  1(A)  at  A  as 

del(A)  =  Xf(A)  +  d£PB(A-,\,M), 

where  d£PB(A\  A,  M)  is  the  e- subdifferential  of  PB( •)  at  A,  which  is  defined  by  [39].  And  V  f(A) 
is  the  gradient  of  f(A )  at  A:  V  f(A)  =  (AX  —  Y)X  r.  Unlike  gradient,  which  is  an  exact  value  for 
a  given  point,  the  subgradient  consists  of  a  set  of  values.  Any  element  in  this  set  is  a  valid  choice 
for  the  subgradient. 

The  PSGM  for  problem  (83)  can  be  described  as  follows.  At  each  point,  we  choose  a  direction 
Uk  from  the  set  of  the  subgradient  del(A),  take  a  step  along  the  opposite  direction  with  a  step  size 
ak,  and  arrive  at  a  tentative  point.  Then  we  project  this  tentative  point  to  the  feasible  convex  set. 
We  continue  like  this  until  reaching  a  point  where  0  is  in  the  subgradient  set.  The  step  size  a k 
should  satisfy  oik  =  oo  and  o?k  <  oo. 

For  problem  (83),  the  feasible  convex  set  is  given  by  the  stationarity  constraint: 

C  =  {A:  ||A||2  ^  1}.  (84) 

The  projection  nc(A)  can  be  done  easily.  We  take  the  singular  value  decomposition  (SVD)  of 
A,  truncate  the  singular  values  that  are  larger  than  1  to  1,  and  then  transfer  back  to  get  IIC(A) 
(Appendix  A).  This  operation  is  called  SVD  projection. 
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PSGM  is  simple  to  implement.  However,  due  to  the  uncertainness  of  the  subgradient  at  the 
non-differentiable  point  of  the  penalty  function,  it  suffers  from  slow  convergence  and  insufficiency 
of  sparsity. 

Alternating  Direction  Method  of  Multipliers  The  alternating  direction  method  of  multipliers 
(ADMM)  is  a  method  based  on  the  dual  ascent  method  and  the  augmented  lagrangian  method  [40] . 
The  basic  idea  is  to  split  the  objective  function  and  variables  into  two  parts,  and  update  them  in  an 
alternating  fashion.  To  apply  ADMM  for  solving  problem  (83),  we  reformulate  the  problem  as 

min  gx  (A)  +  g2  {B) 
s.t.  A  —  B  =  0 
A  e  C, 


where  g, (A)  =  \\\Y  —  AX\\2F,  and  g2(B)  —  Pb(B;  X,M). 

The  augmented  lagrangian  can  be  written  as 

LP(A,  B ,  T)  =  gi(A)  +  g2(B)  +  T  o  (A  -  B)  +  |||A  -  B\\2F,  (85) 

where  “o”  denotes  the  Hadamard  product,  T  is  the  lagrangian  multipliers  and  p  is  the  augmented 
Lagrangian  parameter. 

The  iteration  of  ADMM  for  solving  (85)  consists  of  the  following  steps 

Ak+1  =  argmin  LP(A,  Bk ,  Tk),  (86a) 

Bk+1  =  argmin Lp(Ak+1,  B ,  Tk),  (86b) 

B 

rk+ 1  =  rk  +  p(Ak+1  -  Bk+1).  (86c) 

The  /l-minimization  step  (86a)  can  be  solved  by  projected  gradient  method,  which  requires 
inner  iterations  at  each  step.  The  5-minimization  step  (86b)  is  solved  by  subdifferential  calculus. 
According  to  theoretical  analysis  and  experiment,  ADMM  also  suffers  from  slow  convergence  and 
huge  computation  if  high  accuracy  is  expected. 

Having  discussed  the  equivalent  formulations  and  solutions  for  problem  (83),  we  find  a  more 
efficient  algorithm  is  in  great  need.  Therefore,  we  propose  a  novel  algorithm,  the  Berhu  iterative 
sparsity  pursuit  with  stationarity  constraint  (BIPS)  in  the  following  section. 
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3.3  The  DIPS  Framework 


To  solve  problem  (83)  efficiently,  we  first  introduce  the  thresholding  rule  and  iterative  solution  for 
Berhu,  and  then  propose  the  BIPS  framework. 

3.3.1  The  Thresholding  Rule  for  Berhu 

Following  a  three-step  procedure  [31],  we  can  construct  the  corresponding  thresholding  rule  of  the 
Berhu  penalty  (79): 


It  can  be  seen  clearly  that  Li  penalty  is  put  on  the  variables  which  are  less  than  X  +  M  (the  first 
two  cases),  and  L2  penalty  is  put  on  the  variables  which  are  larger  than  X  +  M  (the  third  case).  We 
let  r)  =  A/M  and  rewrite  TB(»)  as 


This  form  of  TB(»)  shows  that  the  Berhu  penalty  does  simultaneous  selection  and  shrinkage 
controlled  by  a  thresholding  parameter  A  for  the  L\  penalty  and  a  ridge  parameter  r/  for  the  L2 
penalty.  The  threshold  for  choosing  L\  or  L2  is  determined  jointly  by  A  and  //.  Thanks  to  the  smart 
combination  of  the  Li  and  L2  penalties,  Berhu  not  only  does  selection  in  a  similar  fashion  with 
Lasso,  but  also  has  the  ability  to  deal  with  collinearity  and  high  noise  level.  Figure  1 1  shows  the 
penalty  function  and  thresholding  rule  of  Berhu,  in  contrast  with  Lasso  ( Li )  and  Ridge  (L2)  penalty. 
Note  that  the  soft  thresholding  corresponds  to  Lasso.  And  the  ridge  thresholding  is  actually  a 
“shrinking”  operation. 

Having  constructed  the  thresholding  rule  for  Berhu,  we  can  now  use  iterative  thresholding 
procedure  [41]  to  solve  penalized  linear  regression  (77)  with  Berhu  penalty,  which  we  call  the 
Berhu  sparsity  pursuit : 


Aserhu  =  arg  min  f(A)  +  PB(A ;  A,  M ). 
A 


(87) 
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Berhu  penalty 


Lasso  penalty 


Ridge  penalty 


Figure  11:  Penalty  functions  and  corresponding  thresholding  rules.  A  =  0.2,  M  =  1.3. 
Starting  from  an  initial  estimate,  we  update  the  estimate  and  apply  the  thresholding  rule: 

Ak+l  =  TB(Ak  +  XYt  -  XXTAk-  A,  ?y).  (88) 


This  procedure  goes  on  iteratively  until  convergence.  Throughout  the  iteration,  some  of  the 
entries  of  the  estimate  will  be  shrunk  to  exactly  zero.  And  the  thresholding  procedure  acts  like  it  is 
“selecting”  the  variables  that  should  have  nonzero  coefficients.  Therefore,  variable  selection  and 
estimation  are  achieved  simultaneously. 

It  can  be  proved  that,  for  an  arbitrary  matrix  X,  given  it  is  properly  preliminarily  scaled  (A"  <— 
X/ko),  the  iterative  procedure  can  reach  the  global  minimum  of  the  penalized  objective  function 

(Appendix  B). 

It  is  required  that  k0  ^  ||X||2/a/2  for  convergence  guarantee.  On  the  other  hand,  experience 
indicates  that  smaller  k0  leads  to  faster  convergence.  Therefore,  we  choose  ko  =  1 1  A"  1 1 2 / \/2  in 
practice.  Moreover,  a  relaxation  form  of  the  iterative  procedure  can  be  used  to  further  speed  up 
the  convergence,  as  given  in  (89).  According  to  our  experience,  the  number  of  iterations  can  be 
reduced  by  about  40%  compared  to  the  original  form  if  we  choose  u  —  2. 


^k+l  =  ^  +  ^k  +  xyT  _  xxTAk ); 

Ak+1  =  TB{sYk+1-  A) 


(89) 
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Unlike  PSGM  or  ADMM,  where  sparsity  and  computation  speed  are  hurt  by  the  uncertainness 
of  subdifferentials,  the  iterative  thresholding  procedure  keeps  selecting  variables  in  a  determinate 
fashion,  which  contributes  to  fast  convergence  and  sufficient  sparsity. 

3.3.2  The  BIPS  Algorithm 

Based  on  the  iterative  thresholding  procedure,  we  now  introduce  the  BIPS  algorithm,  where  sta- 
tionarity  is  addressed.  As  described  by  Algorithm  1,  the  BIPS  algorithm  consists  of  two  stages. 
The  first  stage  solves  the  Berhu  sparsity  pursuit  (87)  and  gives  an  estimate  A.  We  check  if  the 
stationarity  condition  (80)  is  satisfied  by  A.  If  it  is  satisfied,  we  accept  and  output  this  solution. 
Otherwise,  we  go  to  the  second  stage,  where  we  run  the  iterative  procedure  again,  with  a  stationar¬ 
ity  constraint  (82)  added  in  each  iteration.  That  is,  at  each  iteration,  we  project  the  updated  estimate 
Ak+1  onto  the  convex  set  (84).  Note  that  X  is  normalized  and  preliminarily  scaled  before  running 
BIPS,  so  we  need  to  scale  A  accordingly  before  and  after  the  SVD  projection.  This  is  done  by 
operations  scale  and  scale  Back  in  Algorithm  1. 

In  the  BIPS  algorithm,  both  the  stationarity  condition  (80)  and  its  convex  relaxation,  the  sta¬ 
tionarity  constraint  (82),  is  used.  In  the  first  stage,  we  use  the  stationarity  condition  to  check  if 
the  estimate  given  by  Berhu  sparsity  pursuit  is  stationary;  In  the  second  stage,  we  consider  the 
stationarity  constraint  and  solve  (83),  which  guarantees  a  stationary  estimate.  The  reason  we  keep 
both  stages  is  that  (82)  is  a  sufficient  but  not  necessary  condition  for  (80).  Therefore,  it  may  put 
a  too  strong  constraint  on  the  solution.  If  we  only  run  the  first  stage,  stationarity  is  not  guaran¬ 
teed.  If  we  only  run  the  second  stage,  some  estimates  that  do  not  violate  the  stationarity  condition 
may  be  modified  by  the  stationarity  constraint,  which  leads  to  a  larger  estimation  error.  While 
by  combining  the  two  stages  together,  BIPS  not  only  guarantees  stationarity,  but  also  ensures  the 
identification  and  estimation  accuracy.  It  may  at  first  sight  seem  quite  time  consuming  to  run  two 
iterative  procedures.  However,  for  a  stationary  model,  the  probability  for  the  Berhu  sparsity  pur¬ 
suit  to  violate  the  stationarity  condition  is  actually  very  small.  For  most  of  the  cases,  the  algorithm 
does  not  need  to  run  the  second  stage.  Therefore,  the  average  running  time  of  BIPS  is  in  the  same 
order  with  the  Berhu  sparsity  pursuit. 
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Algorithm  1  The  Berhu  iterative  sparsity  pursuit  with  stationarity  constraint  (BIPS) 
Input:  X  =  [cci,  x2,  ■  ■  ■  ,  £E„_i],  normalized  rowwisely  and  preliminarily  scaled 

Y  =  [*2)  *3,  ■ ' '  )  *n]j  centered  rowwisely 
regularization  parameters:  A,  77 
relaxation  parameter:  a; 
stopping  criterions:  5,  M 

initialization  for  the  estimate:  A0 
{Definition  of  two  operations:  scale  and  scaleBack. 

scale :  rowwise  normalization  and  preliminary  scaling,  as  has  been  done  to  X. 

scaleBack :  inverse  operation  of  scale.} 

1. 

2.  First  run: 

3.  k  —  0;  £/°  =  A0; 

4  repeat 

5.  sYk+l  =  (1  —  iS)sYk  +  c u(Ak  +  XYT  —  XArAlfc);{update  with  relaxation} 

6.  Ak+l  =  Te(^/fc+1;  A,  77); {thresholding  rule  for  Berhu} 

7.  k  =  /c  +  1; 

s.  until  ||Afc+1  —  ^  5  or  /c  ^  M 

9.  A  =  sca/e-Bac/c(Afc+1); 

10. 

1 1 .  Check  stationarity: 

12.  if  p(A)  <  1  then 

13.  go  to  output;  {stationarity  constraint  satisfied,  no  need  for  second  run} 

14.  end  if 

15. 

16.  Second  run: 

17.  k  =  0;  sY{)  =  A 0  =  scale  (A); 

is.  repeat 

19.  srfk+l  =  (1  —  u j)sYk  +  o;(Afc  +  XYT  —  XA'rAfc);{update  with  relaxation} 

20.  Ak+1  =  Te(^/fc+1;  A,  77);  {thresholding  rule  for  Berhu} 

21.  Ak+1  =  scaleBack(Ak+1)\ 

22.  Afc+1  =  nc(Afc+1);  {SVD  projection} 

23.  Ak+1  =  sca/e(ALfc+1); 

24.  k  —  k  +  1; 

25.  until  ||Afc+1  —  Afc||^  ^  5  or  ^  M 

26.  A  =  scaleBack(Ak+1 ); 

27.  40 

Output:  A 


3.3.3  Thresholding-based  Iterative  Screening 


Nowadays,  a  great  challenge  for  network  inference  and  statistical  learning  comes  from  the  large 
scale  of  the  system,  such  as  world  wide  web  and  human  genome  program.  The  huge  dimensionality 
p  requires  the  algorithm  to  have  nice  scalability.  Moreover,  we  may  only  be  able  to  obtain  just  a 
short  snapshot  of  the  system,  either  because  the  measurement  is  too  expensive  or  time  consuming, 
or  simply  because  long  time  measurement  is  impossible.  Therefore,  the  “large  p  small  n”  problem 
has  become  a  hot  topic  [42]. 

When  p  n,  with  the  assumption  that  the  number  of  nonzero  elements  is  far  smaller  than 
n,  we  can  apply  screening  techniques  to  coarsely  select  the  variables  before  finer  estimation.  For 
example,  if  we  are  sure  that  the  number  of  connections  for  each  node  is  much  less  than  fin  (say 
/i  =  0.8),  we  can  first  use  screening  technique  to  select  fin  candidate  nodes,  and  then  apply  BIPS 
on  them  for  further  selection  and  estimation.  Since  BIPS  is  efficient  for  relatively  low  dimensional 
data,  we  save  a  lot  of  time  by  doing  screening  first,  as  long  as  the  screening  technique  is  suffi¬ 
ciently  fast  and  accurate.  Here  we  propose  the  thresholding-based  iterative  screening  (TIS)  for 
VAR  model,  given  in  Algorithm  2. 

As  implied  by  its  name,  TIS  depends  on  an  iterative  selecting  procedure.  However,  it  does 
not  select  variables  using  a  given  thresholding  parameter  A.  Instead,  TIS  keeps  a  fixed  number  s 
( s  =  fmp )  of  nonzero  elements  at  each  iteration.  To  be  specific,  at  the  (k  +  l)th  iteration,  we  find 
the  sth  largest  element  of  the  updated  estimate  Ak+1,  use  it  as  the  current  thresholding  parameter 
Afc+1,  and  apply  the  thresholding  function  onto  the  estimate:  Ak+l  =  TB(Ak+1\  Afe+1,  rj).  (Note 
that  though  we  use  Berhu  here,  the  algorithm  can  be  generalized  easily  to  other  penalties.)  In  this 
way,  after  each  iteration,  we  obtain  an  estimate  with  exactly  s  nonzero  elements,  which  is  the  s 
largest  in  the  updated  estimate.  Since  we  only  need  to  obtain  the  nonzero  entries  in  the  screening 
stage  but  care  little  about  the  exact  value  of  each  element,  we  can  choose  a  relatively  large  £  and 
small  M.  Also,  we  can  use  an  empirical  value  for  rj,  instead  of  running  the  algorithm  with  different 
values  of  q  and  choosing  the  optimal  one.  The  algorithm’s  output  S  =  {sij}i^i,j<^P  is  a  pxp  matrix 
recording  the  entries  in  A  that  are  selected  by  screening.  That  is,  we  have 


(90) 


The  sure  independent  screening  (SIS)  technique  [23]  for  dimensionality  reduction  is  simple  and 
fast,  but  it  relies  on  the  assumption  that  the  predictors  are  independent.  Since  it  only  calculates 
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Algorithm  2  Thresholding-based  iterative  screening  (TIS) 

Input:  X  =  [*1,  x2.  •  ■  ■  ,  ccn_i],  normalized  rowwisely  and  preliminarily  scaled 

Y  =  [x2,x3,  -  ■  ■  ,  a:n],  centered  rowwisely 
regularization  parameters:  A,  rj 
relaxation  parameter:  u 
stopping  criterions:  S,  M 

initialization  for  the  estimate:  A0 
expected  number  of  nonzeros:  s 


l. 


2.  k  —  0;  sYG  =  A0; 

3.  repeat 

4.  sYk+l  =  (1  —  uj)sYk  +  uj(Ak  +  XTt  —  XXTAfc);{update  with  relaxation} 


5.  Afc+1  =  sth  largest  element  of  -«fk+ 1 ;  (determine  the  elements  to  keep} 


6.  Ak+l  =  Tfj(.(Yk+]:  A,  r/);  (force  other  elements  to  be  zero} 

7.  k  —  k  +  1; 

8.  until  ||Afc+1  —  ^  5  or  k  ^  M 

9.  A  =  Ak+1 ; 

10. 

li.  let  5  be  a  p  x  p  matrix,  and 


12. 


Output:  S 
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the  marginal  correlation  between  Y  and  X,  and  chooses  the  variables  accordingly.  On  the  other 
hand,  TIS  is  more  powerful  than  SIS  in  dealing  with  collinearity.  This  can  be  seen  from  the  update 


step  of  the  algorithm,  where  the  information  of  the  correlation  matrix  XXT  is  introduced  into  the 


calculation.  Although  TIS  sacrifices  some  computation  time  due  to  the  iteration,  the  great  gain  in 
selection  accuracy  is  still  worthwhile. 

After  screening,  we  can  apply  BIPS  on  the  reduced  model.  To  achieve  this,  we  just  need  to 
add  one  step  in  the  iteration  of  BIPS,  which  confines  the  updated  estimate  to  the  reduced  model. 
That  is,  after  the  updating  step  (line  5  and  line  19  in  Algorithm  1),  we  only  keep  the  entries  that 
are  selected  in  the  screening  stage  by  letting 


3.3.4  Tuning  Strategy 

The  are  two  regularization  parameters  A  and  rj  in  BIPS  that  need  tuning.  Since  the  estimate  is 
not  quite  sensitive  to  the  ridge  parameter  rj,  it  is  not  necessary  to  run  a  full  two-dimensional  grid 
search  to  look  for  the  best  parameters.  Instead,  we  search  along  a  couple  of  one-dimensional 
solution  paths  including  the  A-paths  (with  rj  fixed)  and  the  //-paths  (with  A  fixed).  Based  on  our 
experience,  the  following  “1-3-1”  strategy  works  well: 

Step  1 :  run  the  first  ridge  path  (A  =  0).  Do  ridge  regression  with  different  values  for  the  ridge 
parameter  rj,  and  get  the  optimal  ridge  parameter  rj*  as  a  reference  of  the  A-paths. 

Step  2:  run  3  A-paths  with  rj  =  0.5 rj*,  0.057/*,  0,  0057/*  respectively.  For  each  value  of  rj ,  run 
BIPS  with  a  number  of  values  for  A,  and  find  the  optimal  A*.  Then  we  will  have  three  A*’s,  one 
from  each  path.  Choose  the  one  that  gives  the  smallest  prediction  error  and  let  it  be  the  optimal 
thresholding  parameter  A**. 

Step  3:  run  the  final  //-path  with  A**.  Choose  the  optimal  ridge  parameter  rj**  along  the  path. 
The  (A**,  rj**)  is  our  final  choice  of  the  two  parameters. 

A  proper  criterion  is  needed  for  choosing  the  optimal  parameter  at  each  step.  As  is  well  known, 
if  we  simply  use  the  squared  fitting  error  to  be  the  criterion,  it  easily  leads  to  overfitting,  especially 
when  the  number  of  observations  is  limited.  Moreover,  on  the  path  there  are  sparse  solutions,  so 
we  need  a  criterion  that  considers  different  sparsity  patterns.  For  Step  1,  we  use  AIC  to  choose  the 
optimal  ridge  parameter  [43].  For  Step  2  and  Step  3,  we  adopt  the  K-fold  selective  cross-validation 
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(SCV)  score  as  the  tuning  criterion  [31].  Here  we  denote  the  parameter  to  be  tuned  as  u>  (it  can  be 
the  thresholding  parameter  A  or  the  ridge  parameter  rj).  First,  we  apply  the  BIPS  algorithm  to  the 
whole  dataset  along  an  cu-path,  obtaining  the  solutions  A(u;)’s  and  the  associated  sparsity  patterns 
nzu  =  nz(A(u)).  Second,  we  apply  A'-fold  SCV  to  A{uS)  for  each  c u:  for  each  fold,  we  run  a 
simple  ridge  regression  on  the  training  data  with  only  the  predictors  picked  by  nzu,  and  record  the 
prediction  error  from  the  testing  data.  The  averaged  prediction  error  of  the  K  folds  is  defined  to 
be  the  SCV  score  for  the  corresponding  value  of  u>.  Finally,  we  determine  the  optimal  value  cuopt 
as  the  one  that  gives  the  smallest  SCV  score  along  the  path. 

3.4  Bootstrap  Enhanced  Learning 

The  “TIS+BIPS”  framework  proposed  in  Section  3  is  a  powerful  tool  for  sparse  and  stationary 
estimation  of  VAR  model  in  the  “large  p  small  n”  scenario.  However,  a  “one-time”  estimate, 
without  p-value  or  confidence  interval,  may  not  be  trustworthy  in  practice.  Bootstrap  [44]  is  a 
powerful  nonparametric  tool  for  approximating  the  distributions  of  statistics,  confidence  intervals, 
or  rejection  probabilities  of  tests.  It  resamples  the  data  and  recalculates  the  statistics  using  the 
resampled  data.  From  the  recalculated  statistics,  we  can  estimate  the  distributions  of  interest  and 
construct  confidence  intervals.  Hence,  we  propose  the  bootstrap  enhanced  BIPS  (BE-BIPS),  which 
provides  a  measure  of  the  confidence  about  weather  a  connection  exists  in  the  VAR  model. 

3.4.1  The  BE-BIPS  Framework 

Assume  that  the  connections  between  two  arbitrary  nodes  in  the  network  follows  a  distribution  that 

{=  0,  with  probability  1  —  £tj 
7^  0,  with  probability  • 

By  bootstrapping,  we  can  approximate  this  distribution  and  obtain  the  empirical  value  for  The 
BE-BIPS  framework  is  described  as  follows: 

Step  1 :  run  BIPS  over  the  original  dataset  X.  Record  the  pattern  of  A,  which  is  a  p  x  p  matrix 
P,  defined  in  the  same  way  as  S  in  (90). 

Step  2:  Draw  a  bootstrap  sample  X*  from  the  original  data.  Repeat  Step  1  for  X*. 

Step  3:  Repeat  Step  2  for  B  times.  And  record  the  pattern  P*  for  the  jth  bootstrap  sample. 
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Adding  up  all  the  patterns  P*' s  and  normalizing  it  by  B,  we  define  the  matrix  E  =  {eij}i^i,j^P 
of  connection  existence  probability: 


Given  a  sufficiently  large  B,  the  connection  existence  probability  et]  is  a  good  approximation 
of  which  can  serve  as  a  measure  of  how  confident  we  are  about  the  existence  of  each  possible 
connection  in  the  network.  For  example,  if  =  82%,  it  means  that  in  82%  of  the  bootstrap 
samples,  a  connection  is  identified  from  node  j  to  node  i.  So  we  can  say  the  probability  for 
the  existence  of  this  connection  is  approximately  82%.  Given  the  probability  matrix  E,  we  can 
enforce  a  threshold  e*  on  the  connection  existence  probability,  and  choose  only  the  connections 
with  eij  ^  e*  for  further  study. 

The  outcome  of  BIPS  can  be  viewed  as  a  sparse  weighted  digraph,  with  the  weight  of  an  edge 
being  the  regulation  strength.  On  contrast,  the  outcome  of  the  BE-BIPS  is  a  complete  weighted 
digraph,  with  the  weight  of  an  edge  being  the  probability  of  its  existence. 

3.4.2  Stationary  Bootstrap 

There  are  different  resampling  schemes  to  draw  bootstrap  samples  from  the  original  data  in  Step  2. 
If  the  observations  are  independent  and  identically  distributed,  we  can  resample  the  data  randomly 
with  replacement  [44].  When  the  observations  are  time  series,  the  problem  are  more  complicated, 
since  the  observations  are  largely  dependent  on  each  other,  and  we  would  like  to  keep  this  depen¬ 
dent  information  when  doing  bootstrap.  To  preserve  the  temporal  dependent  structure  of  the  data, 
techniques  such  as  resampling  blocks  of  consecutive  observations  or  resampling  “blocks  of  blocks” 
are  proposed  for  bootstrapping  time  series  [45].  The  basic  idea  of  block  bootstrap  methods  is  that, 
though  individual  observations  may  be  dependent,  blocks  of  observations  can  be  approximately 
independent  with  each  other  given  a  proper  block  size  l. 

When  the  time  series  is  stationary,  it  is  natural  to  require  the  pseudo  time  series  obtained  by 
the  resampling  scheme  to  be  also  stationary.  The  stationary  bootstrap  [24]  is  a  bootstrap  method 
with  this  property.  It  is  based  on  resampling  blocks  of  random  length,  where  the  length  of  each 
block  follows  a  geometric  distribution  with  mean  1/r.  There  is  a  simple  method  to  conduct  such 
resampling.  Given  that  x*  is  chosen  to  be  the  ./ th  observation  xj  in  the  original  time  series,  we 
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choose  x*i+l  based  on  the  following  rule: 


xi+ 1  is 


chosen  to  be  xj,  with  probability  1  —  r 
picked  randomly  from  {xt}^=1,  with  probability  r. 


Similarly  with  block  bootstrap,  where  the  block  size  l  has  to  be  determined,  the  value  of  r 
should  be  chosen  properly.  Good  news  is  that  the  sensitivity  of  r  in  stationary  bootstrap  is  less 
than  that  of  l  in  block  bootstrap. 


3.5  Experiment 

3.5.1  Performance  Measures 

To  examine  the  performance  of  the  proposed  methods,  we  define  the  following  measures. 

Violation  Percentage  ( Pvio ):  In  T  repeated  experiments,  if  there  are  T,mo  experiments  in  which 
the  estimate  A  violates  the  stationary  condition  (80),  then  the  violation  Percentage  is  defined  as 

Pvio  =  Tvio/T- 

Miss  Probability  ( Pmiss ):  If  al3  ^  0,  al3  =  0,  we  say  there  is  a  miss.  Denote  Cmiss  as  the 
total  number  of  misses  and  let  Cnz  be  the  number  of  nonzero  entries  in  A.  The  Miss  Probability  is 
defined  as  Pmiss  Gmissj Gnz- 

False  Alarm  Probability  (P/a):  If  o.t]  =  0,  al3  ^  0,  we  say  there  is  a  false  alarm.  Denote  Cja 
as  the  total  number  of  false  alarms  and  let  Cz  be  the  number  of  zero  entries  in  A.  The  false  alarm 
Probability  is  defined  as  Pfa  =  Cfa/Cz. 

Prediction  Error  (prdErr):  The  prediction  error  is  defined  as  prdErr  =  1 1 Y^st  - Xjfst A 1 1 F /ntest , 
where  YtJst  and  Xjest  is  the  testing  data,  and  ntest  is  the  length  of  the  testing  data. 

Running  Time  (runTime):  The  averaged  running  time  of  an  algorithm.  All  the  algorithms  are 
run  in  MATLAB7. 1 1 .0  on  a  PC  with  4GB  memory. 


3.5.2  Experiment  Settings 

We  generate  the  p  x  p  autoregressive  coefficient  matrix  A  with  both  sparsity  and  stationarity  prop¬ 
erties.  First,  the  topology  is  generated  from  a  directed  random  graph  G(p ,  £),  where  the  edge 
from  one  node  to  another  node  occurs  independently  with  probability  £.  Then,  the  strength  of  the 
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edges  is  generated  independently  from  a  Gaussian  distribution.  This  process  is  repeated  until  we 
obtain  a  matrix  A  that  has  the  desired  spectral  radius  p(A).  For  all  the  experiment  result  shown, 

f  =  0.05,  p(A)  =  0.99. 

For  a  A-path,  we  use  a  grid  of  100  values  for  A,  which  is  picked  from  the  interval  [0,  ||A°  + 
XYt  —  XXTA° Hoc],  The  initial  estimate  is  simply  set  as  A0  =  0.  For  a  77-path,  we  use  a  grid  of 
76  values  for  77,  which  is  picked  from  the  interval  [2~10,  25].  The  number  of  folds  for  SCV  is  set  to 
be  K  =  5.  We  examine  the  experiment  results  for  different  combination  of  network  size  p,  sample 
size  n,  and  noise  level  a. 

All  the  statistics  we  collect  are  values  averaged  over  100  repeated  experiments. 

3.5.3  Performance  of  BIPS 

Table  3.5.3  compares  the  performance  of  Lasso,  Berhu,  and  BIPS.  Here,  by  Lasso  and  Berhu, 
we  mean  the  penalized  linear  regression  estimates  given  by  the  L1  penalty  and  the  Berhu  penalty, 
respectively. 

Comparing  the  Pvio' s,  we  see  that  it  is  possible  for  the  penalized  linear  regression,  no  matter 
Lasso  or  Berhu  is  used  for  penalty,  to  give  a  nonstationary  estimate  for  a  stationary  model.  While 
the  proposed  BIPS  algorithm  can  guarantee  the  stationarity  property  of  A.  Therefore,  adding  the 
stationarity  constraint  into  the  sparsity  pursuit  does  effectively  prevent  the  estimate  from  becom¬ 
ing  nonstationary.  Moreover,  Berhu  outperforms  Lasso  in  both  selection  accuracy  and  estimation 
accuracy,  which  proves  the  advantages  of  Berhu  over  Lasso.  When  p  >  n,  there  is  definitely  high 
collinearity  in  the  data  matrix  A".  In  this  situation,  Berhu  can  do  better  than  Lasso  thanks  to  the  L2 
term  in  the  penalty  function. 

To  further  illustrate  the  disadvantages  of  a  nonstationary  estimate,  we  examine  its  dynamic 
behavior  by  looking  at  the  sample  paths.  For  better  illustration,  we  choose  a  small  network  size 
p  —  20.  And  n  =  100,  a  =  1.  From  the  repeated  experiment,  we  find  one  run  where  Lasso  gives  a 
nonstationary  estimate  ALasso.  Starting  from  a  time  point,  we  observe  a  sample  path  from  the  true 
model  for  100  time  points.  Then  we  start  from  the  same  initial  state  and  calculate  the  sample  paths 
for  the  next  100  time  points  using  both  ALasso  and  ABIPS.  The  sample  path  of  the  true  model  and 
those  of  the  two  estimated  models  are  plotted  in  Figure  12.  We  can  easily  see  that  ABIPS  gives 
a  reasonable  imitation  of  the  true  system.  However,  the  nonstationary  estimate  ALasso  blows  up 
quickly  and  behaves  completely  different  from  the  true  model.  This  tells  us  that  guaranteeing  a 
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p/n/ a 

algorithm 

P  . 

1  miss 

Pfa 

prdErr 

P  - 

1  VIO 

Lasso 

0.213 

0.139 

2.166 

0.02 

100/50/1 

Berhu 

0.185 

0.177 

2.079 

0.04 

BIPS 

0.186 

0.175 

2.075 

0 

Lasso 

0.211 

0.135 

19.244 

0.01 

100/50/10 

Berhu 

0.188 

0.170 

19.224 

0.02 

BIPS 

0.188 

0.170 

19.223 

0 

Lasso 

0.177 

0.189 

1.656 

0.01 

100/80/1 

Berhu 

0.157 

0.122 

1.553 

0.01 

BIPS 

0.157 

0.121 

1.553 

0 

Lasso 

0.186 

0.194 

16.912 

0.02 

100/80/10 

Berhu 

0.169 

0.126 

15.639 

0.03 

BIPS 

0.170 

0.124 

15.623 

0 

Table  2:  Performance  comparison  of  Lasso,  Berhu,  and  BIPS 


p/n=300/80 

p/n=400/80 

p/n=500/80 

TIS 

0.229 

0.308 

0.377 

SIS 

0.491 

0.537 

0.579 

Table  3:  Pmiss  of  TIS  and  SIS 


stationary  estimate  is  indeed  crucial. 


3.5.4  Performance  of  TIS 

To  examine  TIS’s  performance  for  dimensionality  reduction  and  variable  selection,  we  first  com¬ 
pare  it  with  the  SIS  method  by  looking  at  the  PmiSS  of  the  two  methods.  That  is,  to  the  same  set 
of  data,  we  apply  SIS  and  TIS  separately  and  compare  the  patterns  obtained  by  them  with  the  true 
topology.  We  let  /i  =  0.9  in  the  experiment.  Table  3.5.4  shows  the  result  under  different  (p,n) 
combinations.  We  can  see  that  TIS  has  a  much  smaller  PmiSS  than  SIS. 

Now  we  run  BIPS  with  and  without  TIS  and  check  the  difference  of  the  performance.  We  set  the 
noise  level  a  —  1,  and  compare  the  two  algorithms  under  different  p/n  ratios.  From  table  3.5.4, 
we  can  see  that,  when  p/n  ratio  is  large  enough,  adding  TIS  not  only  improves  the  estimation 
accuracy,  but  also  saves  a  lot  of  time.  As  the  ratio  becomes  larger,  the  improvement  becomes  more 
significant.  On  the  other  hand,  when  p/n  ratio  is  too  large,  even  TIS  fails  to  be  satisfactory.  And 
this  is  an  important  motivation  for  using  the  bootstrap  enhanced  learning  method. 
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Figure  12:  Comparison  of  Patterns  and  Sample  Paths.  Top:  pattern  of  A  and  observations  from  the 
true  model;  Middle:  pattern  of  ABIPg  and  sample  path  from  the  corresponding  stationary  model; 
Bottom:  pattern  of  APLasso  and  sample  path  from  the  corresponding  nonstationary  model. 

3.6  Application  to  U.S.  Macroeconomic  Data 

We  apply  the  proposed  learning  framework  to  the  U.S.  macroeconomic  data.  The  dataset  con¬ 
sists  of  quarterly  observations  on  108  macroeconomic  variables  from  1960:1  to  2008:1V,  which 
belong  to  12  categories.  The  dataset  with  detailed  description  can  be  found  on  "http  :  /  /www . 
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p/n 

algorithm 

P  . 

1  miss 

Pfa 

prdErr 

runTime 

300/80 

TIS+BIPS 

0.288 

0.072 

0.985 

1216 

BIPS 

0.284 

0.073 

0.987 

1363 

400/80 

TIS+BIPS 

0.379 

0.058 

1.234 

1153 

BIPS 

0.423 

0.050 

1.321 

3091 

500/80 

TIS+BIPS 

0.473 

0.045 

1.601 

1747 

BIPS 

0.529 

0.040 

1.727 

5457 

Table  4:  Performance  of  TIS 


princeton .  edu/  ~mwat  son/wp  .  html " .  The  data  has  been  preprocessed  so  that  each  time 
series  is  a  stationary  process.  The  stationary  and  sparse  VAR  model  can  be  used  to  identify  the 
Granger  causal  relationships  between  different  variables. 

3.6.1  Comparison  of  Rolling  MSE 

First,  we  use  Rolling  forecast  to  compare  the  performance  of  Lasso  and  BIPS.  Rolling  forecasting 
procedure  [46]  is  commonly  used  in  macro  forecasting.  Define  the  rolling  window  size  to  be  w. 
Standing  at  time  point  to,  apply  the  estimation  algorithm  to  the  most  recent  w  observations  in  the 
past,  i.e.,  observations  from  time  t0  —  w  +  1  to  t0:  {xt}tLt0-w+ 1-  Then  use  this  estimated  model 
to  forecast  xt+h.  This  forecasting  procedure  is  repeated  as  the  rolling  window  slides  from  the 
beginning  of  the  time  series  to  the  end.  Denote  the  forecast  for  xt+h  as  xt+h-  The  rolling  MSE  is 

defined  as  MSErollin°  =  n_hl_w+l  Et=w  II xt+h  ~  xt+h\\l 

We  first  study  the  dataset  by  category.  To  each  of  the  12  categories,  we  apply  Lasso  and 
BIPS  respectively  with  h  —  l,w  —  0.8  x  d,  where  d  is  the  number  of  time  series.  Table  3.6.1 
shows  the  rolling  MSE  of  Lasso  and  BIPS,  normalized  by  that  of  the  AR(4)  benchmark,  which  is 
a  conventional  benchmark  of  macro  forecasting.  We  use  the  same  window  size  for  AR(4)  model 
as  the  VAR  model. 

Compared  with  the  AR(4)  model,  both  Lasso  and  BIPS,  which  solve  a  VAR  model,  have  ob¬ 
tained  a  much  smaller  forecasting  error,  except  for  Category  7.  The  reason  is  that,  by  introducing 
the  Granger  causal  interactions  between  different  indices,  the  VAR  model  becomes  more  powerful 
than  the  univariate  AR  model  in  modeling  and  forecasting,  given  the  same  amount  of  observa¬ 
tions.  The  exception  of  Category  7  may  be  due  to  the  higher  order  of  the  univariate  AR  model. 
Or  maybe  the  Granger  causal  relationships  among  Category  7  are  weak.  The  variables  are  more 
self-regulated. 
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Category 

1 

2 

3 

4 

5 

6 

Lasso 

0.589 

0.846 

0.936 

0.289 

0.071 

0.506 

BIPS 

0.445 

0.576 

0.711 

0.165 

0.033 

0.217 

Category 

7 

8 

9 

10 

11 

12 

Lasso 

1.971 

0.552 

1.443 

0.114 

0.370 

0.254 

BIPS 

1.874 

0.207 

0.738 

0.065 

0.107 

0.100 

Table  5:  Normalized  Rolling  MSE  of  Lasso  and  BIPS  for  each  category 


h 

1 

2 

4 

8 

16 

32 

Lasso 

0.0174 

0.0203 

0.0292 

0.3647 

329.9375 

3.1  x  108 

BIPS 

0.0171 

0.0184 

0.0187 

0.0197 

0.0198 

0.0176 

Table  6:  Rolling  MSE  of  Lasso  and  BIPS  for  different  horizons 


Moreover,  we  note  that  BIPS  gives  smaller  forecasting  errors  than  Lasso  for  all  the  12  cate¬ 
gories  of  macro  time  series.  It  indicates  that,  by  adding  a  stationarity  constraint,  we  are  able  to 
capture  the  network  dynamics  more  accurately  and  achieve  a  stronger  capability  of  forecasting.  To 
further  support  this  conclusion,  we  apply  Lasso  and  BIPS  respectively  to  the  whole  dataset,  with 
w  =  0.8  x  d  and  different  horizons  h.  The  rolling  MSE  for  h  =  1,  2, 4,  8, 16,  32  is  recorded  in  Ta¬ 
ble  3.6.1.  As  the  horizon  increases,  the  rolling  MSE  of  Lasso  grows  exponentially,  which  indicates 
that  some  estimates  of  Lasso  are  nonstationary  and  therefore  completely  fail  to  forecast  for  large 
horizons.  On  the  other  hand,  thanks  to  the  stationarity  constraint,  the  rolling  MSE  of  BIPS  stays  at 
the  same  order  of  magnitude.  This  phenomenon  is  similar  with  what  is  shown  by  Ligure  12.  They 
have  illustrated  the  fundamental  difference  of  BIPS  from  the  original  penalized  linear  regression 
in  forecasting  capability. 


3.6.2  Bootstrap  Analysis 

We  now  choose  80  indices  for  further  analysis  using  stationary  bootstrap.  We  apply  the  BE-BIPS 
to  these  times  series  and  analyze  their  Granger  causal  connections.  Since  the  economic  structure 
of  U.S.  has  gone  through  a  big  change  in  the  “Great  Moderation”  in  mid- 1980  [47],  we  expect  to 
see  significantly  different  topologies  for  the  macroeconomic  network  before  and  after  mid- 1980. 
Hence,  we  divide  the  time  series  into  two  periods,  the  pre-Great  Moderation  period  and  the  post- 
Great  Moderation  period,  and  apply  BE-BIPS  separately  to  the  two  periods. 

For  the  pre-Great  Moderation  period,  we  choose  80  observations  which  are  from  1960:1  to 
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Figure  13:  Topology  of  the  macroeconomic  network  in  the  pre-Great  Moderation  period. 
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Figure  14:  Topology  of  the  macroeconomic  network  in  the  post-Great  Moderation  period. 

1979:IV.  For  the  post-Great  Moderation  period,  we  choose  80  observations  which  are  from  1985:1 
to  2004:IV.  The  time  series  are  resampled  using  the  R  function  tsboot  with  default  parameter  values 
[48].  The  number  of  bootstrap  samples  is  set  to  be  B  =  100.  The  threshold  for  the  connection 
existence  probability  is  chosen  to  be  e*  =  80%.  The  topologies  we  obtained  for  the  pre-Great 
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Moderation  period  and  the  post-Great  Moderation  period  are  shown  in  Figure  13  and  Figure  14, 
respectively.  In  the  figures,  only  the  indices  that  have  connection(s)  with  others  are  plotted. 

After  Great  Moderation,  the  business  cycle  fluctuations  have  gone  through  great  reduction 
in  volatility.  And  this  reduction  has  been  clearly  reflected  by  the  topology  change.  In  the  pre- 
Great  Moderation  period,  the  macro  variables  actively  interacted  with  each  other,  which  forms  a 
very  complex  dynamic  network.  While  after  Great  Moderation,  the  interactions  have  remarkably 
reduced,  making  it  easier  for  the  network  to  stay  stable. 

3.7  Conclusion 

In  this  paper,  we  study  the  estimation  of  the  first  order  stationary  and  sparse  vector  autoregressive 
(VAR)  model.  Distinguished  from  the  existing  related  work,  we  focus  on  the  stationarity  property 
of  VAR  and  the  “large  p  small  n”  scenario.  To  solve  the  nonstationarity  problem  of  the  penal¬ 
ized  linear  regression  estimation,  we  consider  adding  a  stationarity  constraint,  and  propose  the 
BIPS  algorithm  to  give  an  efficient  solution.  For  “large  p  small  n”  problem,  we  implement  the 
thresholding-based  iterative  screening  into  the  BIPS  algorithm  to  improve  the  identification  accu¬ 
racy  and  computation  efficiency.  The  bootstrap  enhanced  BIPS  is  proposed  to  provide  a  confidence 
measure  for  each  possible  connection  in  the  network.  The  VAR  model  is  widely  used  in  many  re¬ 
search  domains  for  time  series  analysis  and  network  identification.  In  this  paper,  we  apply  it  to 
the  analysis  of  the  U.S.  macroeconomic  data  and  make  some  interesting  discovery.  The  proposed 
framework  can  also  be  used  to  infer  the  brain  function  connectivity  through  fMRI  data  and  the 
gene  regulatory  network  through  microarray  data,  which  will  be  left  for  future  work. 

4  Conclusions 

In  this  project,  we  have  accomplished  the  research  objective  of  developing  transformative  theory 
and  algorithms  for  robust  Automated  Target  Recognition  (ATR).  Specifically,  we  have  developed 
the  following  new  techniques: 

•  kernel  local  feature  extraction  (KLFE)  for  ATR  applications, 

•  technique  for  identifying  network  dynamics  under  sparsity  and  stationarity  constraints, 

•  self-organized-queue-based  (SOQ)  clustering  scheme, 
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•  robust  principal  component  analysis  (RPCA)  based  on  manifold  optimization,  outlier  detec¬ 
tion,  and  subspace  decomposition. 
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